Re-targetable tools and methodologies for the efficient deployment of high-level source code on coarse-grained dynamically reconfigurable architectures by Muir, Mark I.R.
Re-Targetable Tools and Methodologies for the 
Efficient Deployment of High-Level Source 
Code on Coarse-Grained Dynamically 
Reconfigurable Architectures
M ark I. R. Muir
O
A thesis subm itted for the degree o f  D octor o f Philosophy. 
The U niversity o f Edinburgh.
O ctober 2009
Abstract
R econfigurable com puting traditionally  consists o f a data path m achine (such as an FPGA ) 
acting as a co-processor to a conventional m icroprocessor. This involves partitioning the appli­
cation such that the data path intensive parts are im plem ented on the reconfigurable fabric, and 
the control flow intensive parts are im plem ented on the m icroprocessor. Often the two parts 
have to be w ritten in different languages. New highly parallel data path architectures allow  par­
allelism  approaching that o f  FPG A s, but are able to be reconfigured very rapidly. As a result, it 
is possible to use these architectures to perform  control flow in a m anner sim ilar to a m icropro­
cessor, and thus a com plete program  can be described from  an unm odified high-level language 
(in particular C). This overcom es the historical instruction-level parallelism  (ILP) wall.
To m ake full use o f  the available parallelism , existing m icroprocessor tool flows are insufficient. 
D ata path m achines are typically  program m ed via H D L  tools from  the A SIC design world. 
This expresses algorithm s at a low er level than the application algorithm s are typically  devel­
oped in. The w ork in this thesis builds upon earlier w ork to allow  applications to be described 
from  high-level languages, by em ploying low -level optim isations in the com piler back-end and 
w orking from  the assem bly, to m axim ise parallel efficiency. This consists o f  scheduling, w here 
known techniques are used to pack instructions into basic b locks that m ap well to the reconfig­
urable core (optim ising spatial efficiency); then autom atic pipelin ing  is applied to dram atically  
im prove the achievable throughput (optim ising tem poral efficiency). Together these can be 
thought o f  as “instruction-level parallelism  done righ t” . Speed-ups o f m ore than an order of 
m agnitude w ere achieved, y ielding throughputs o f  180-380M Pixels/s on typical im age signal 
processing tasks, m atching the perform ance o f  hard-w ired ASICs.
Furtherm ore, conventional softw are-based sim ulation technologies for data path m achines are 
too slow for use in application verification. This thesis dem onstrates how a h igh-speed software 
em ulator can be created for self-controlled dynam ically reconfigurable data path m achines, 
using a static serialisation o f the data paths in each configuration context. This yields run-tim e 
perform ance several orders o f m agnitude higher than existing techniques, m aking it suitable for 
use in feedback-directed optim isation.
Declaration of originality
I hereby declare that the research recorded in this thesis and the thesis itself w as com posed and 
orig inated  entirely  by m yself in the School o f  E ngineering at T he U niversity o f  Edinburgh.
M ark M uir
in
Acknowledgements
First, I w ish to express my sincere gratitude to my supervisors and colleagues w ho had the 
stam ina to read through several drafts o f this thesis and provide m e w ith excellent feedback.
Further thanks go to m y supervisors: Dr Iain L indsay and Professor Tughrul Arlsan. As p ri­
m ary supervisor, Iain Lindsay w ent out o f  his w ay to help expand my know ledge through 
countless m ulti-hour d iscussions throughout the course o f m y PhD. O ne m ust not underesti­
mate the im portance o f  d iscussing ideas w ith som eone o f  great intellect, in order to really test 
ones understanding, and to thoroughly question every assum ption and the validity o f  any claim s 
and conclusions made. Tughrul A rlsan’s w ide connections in the academ ic w orld allow ed me 
to keep my w ork focussed at the cutting edge, and to m eet other team s in ternationally  and ex­
change know ledge. He also founded the com pany Spiral G atew ay to com m ercialise the RICA 
technology and associated softw are, w hich allowed my w ork to be deployed in a com m ercial 
environm ent and subject to the level o f rigour that entails.
1 thank Spiral G atew ay for em ploying me to do m uch o f  this work, and for giving me the 
opportunity to put my w ork into real use, by real people. W orking in a com m ercial environm ent, 
and with such a dedicated team — both engineering and m anagem ent— has given m e direct, first- 
class exposure to the experience o f running a com pany, creating  and m arketing a product, and 
com m unicating  with custom ers. I have been fortunate over my tim e in the com pany to be able 
to see our ideas go from  concept to a  product nearing final tape out.
M y colleagues in the U niversity and in Spiral Gateway have been invaluable in providing a fun 
and in tellectually  fast-paced w orking environm ent, and for being a continuous source o f  ideas 
and topics o f  conversation.




D eclaration o f o r ig in a l i ty ........................................................................................................ iii
A c k n o w le d g e m e n ts ................................................................................................................... iv
C o n te n ts ......................................................................................................................................... v
L ist o f  f i g u r e s ..................................................................................................................................viii
L ist o f  tables .............................................................................................................................. xii
A b b re v ia tio n s ..................................................................................................................................xiii
N o m e n c la tu re .................................................................................................................................xvii
1 Introduction 1
1.1 P u b lic a tio n s ..................................................................................................................................  4
1.2 N o v e l t y .......................................................................................................................................... 4
1.3 S tru c tu re .......................................................................................................................................... 6
2 Background 7
2.1 C om puting A rc h ite c tu re s ......................................................................................................... 8
2.1.1 A pplication-Specific Integrated C i r c u i t s ............................................................. 8
2.1.2 F ield P rogram m able G ate A rrays ........................................................................  8
2.1.3 M ic ro p r o c e s s o r s .........................................................................................................  9
2.1.4 M u l t i - C o r e ....................................................................................................................  10
2.1.5 A pplication-Specific Instruction Set P r o c e s s o r s ..............................................  10
2.1.6 C oarse-G rained Reconfigurable A rc h ite c tu re s .................................................. 11
2.1.7 D ynam ically  Reconfigurable A r r a y s ....................................................................  11
2.1.8 R I C A ...............................................................................................................................  12
2.2 Program m ing M ethodologies for Reconfigurable A rc h ite c tu re s ...............................  14
2.2.1 R e-targetable T o o lc h a in s ..........................................................................................  16
2.2.2 Supporting O peration C haining ............................................................................ 16
2.2.3 W orking From  A s s e m b ly ..........................................................................................  17
2.3 Previous W o rk ...............................................................................................................................  20
2.3.1 The W ork o f This T h e s i s ..........................................................................................  22
2.3.2 Further W o r k ................................................................................................................. 25
2.4 S u m m a r y ......................................................................................................................................  26
3 Em ulation  27
3.1 B a c k g ro u n d ..................................................................................................................................  29
3.1.1 Background: E m u l a t i o n ..........................................................................................  29
3.1.2 Background: M odelling  D ata Path Parallelism  ..............................................  31
3.1.3 Contribution: Load-Tim e S e r ia l isa tio n ................................................................. 33
3.2 The M odelled S y s te m ................................................................................................................ 35
3.3 Em ulator T e c h n o lo g y ................................................................................................................  37
3.3.1 E xtensibility  ................................................................................................................  38
3.3.2 Contribution: Serialisation A lg o r i th m ................................................................. 43
3.4 R e s u lts .............................................................................................................................................  45
v
Contents
3.4.1 Results: Execution Speed For a R ange o f  S tandard B enchm arks . . . .  45
3.4.2 Results: Effect o f  D ata Path S h a p e ....................................................................  47
3.5 S u m m a r y ...................................................................................................................................... 49
4 S ch ed u lin g  51
4.1 Problem  D escription ................................................................................................................ 52
4.2 E x a m p le .......................................................................................................................................... 54
4.3 Scheduling Stages O v e r v ie w .................................................................................................  62
4.4 L i n k i n g .......................................................................................................................................... 64
4.4.1 Live Sym bol Identification A lg o r i th m .............................................................  66
4.5 DFG A n a ly s is ............................................................................................................................... 67
4.5.1 DFG A nalysis A lg o r ith m .......................................................................................  70
4.6 CFG  A n a ly s is ............................................................................................................................... 72
4.6.1 CFG  A nalysis A lg o r ith m .......................................................................................  73
4.7 Live R egister Id e n ti f ic a t io n .....................................................................................................  75
4.7.1 Contribution: Live R egister Identification A lg o r i th m ................................  77
4.8 P a ra llé lis a tio n ............................................................................................................................. 84
4.8.1 Tem porary R egister A s s ig n m e n t ......................................................................... 84
4.9 Scheduling A lg o r ith m ................................................................................................................  87
4.9.1 Background: List Scheduling ............................................................................  87
4.9.2 Contribution: T ree F o l lo w e r ................................................................................  92
4.10 R egister Starvation A v o id a n c e ............................................................................................  95
4.10.1 R e w i n d ......................................................................................................................... 96
4.10.2 Shuffle ......................................................................................................................... 98
4.10.3 Basic Block S p l i t t in g ...................................................................................................100
4.10.4 Serialisation ............................................................................................................. 101
4.11 R esource C o n fig u ra tio n ................................................................................................................ 102
4.11.1 Background: R M EM  C a s c a d in g .............................................................................104
4.11.2 Contribution: RM EM  C ascading A lg o r ith m .......................................................107
4.12 G lobal R egister R eallocation I n f o r m a t io n ......................................................................... 109
4 .1 2 .1 Contribution: O btaining The Global Register R eallocation Inform ation 113
4.12.2 Contribution: U sing The Global Register R eallocation Inform ation . . . 115
4.13 R e su lts ............................................................................................................................................... 117
4.13.1 Results: Scheduling A lg o r ith m .................................................................................117
4.13.2 Results: Live Register Id e n tif ic a tio n ......................................................................122
4.13.3 Results: R egister S tarvation A v o id a n c e .............................................................. 125
4.13.4 Results: G lobal R egister R e a l lo c a t io n .................................................................. 130
4.14 S u m m a r y ........................................................................................................................................133
5 P ip e lin in g  135
5.1 Background: S tructural P ip e l in in g .........................................................................................138
5.1.1 Background: Softw are P ip e l i n in g ......................................................................... 139
5.2 P re c o n d it io n s ................................................................................................................................. 140
5.3 Contribution: D ynam ic P ip e l i n in g .........................................................................................143
5.3.1 Contribution: P ipeline Stage A llocation A lg o r i th m .........................................144
5.4 Contribution: M ulti-S tep P ip e l i n in g ..................................................................................... 147
5.5 Contribution: S ingle-Step P ip e l in in g ..................................................................................... 150
vi
C ontents
5.5.1 C ontribution: H ardw are M odifications For S ingle-S tep Pipelining . . . 153
5.5.2 C ontribution: Softw are M odifications For S ingle-S tep P ipelining . . . .  155
5.6 C ontribution: A utom ating The C hoice o f T im ing C onstraint ..........................................158
5.7 C ontribution: Support for Internally  P ipelined C e l l s ......................................................... 160
5.7.1 C ontribution: Scheduling In ternally  P ipelined C e l l s ..................................161
5.7.2 C ontribution: P ipelining K ernels W ith Internally  P ipelined Cells . . . .  164
5.8 R e s u lts ................................................................................................................................................ 165
5.8.1 Results: D ynam ic P ip e l in in g ............................................................................. 165
5.8.2 Results: Internally Pipelined C e l l s .................................................................. 174
5.8.3 Results: A utom atic T im ing C o n s tra in t ...........................................................179
5.9 S u m m a r y ..........................................................................................................................................184
6 C onclusions 185
6.1 E m u la t io n ..........................................................................................................................................186
6.1.1 Em ulation: P roblem  D e s c r ip tio n ......................................................................186
6 .1.2 Em ulation: D em onstrated  O utcom es and C ontribution to Know ledge . 186
6.1.3 Em ulation: F urther W o r k .....................................................................................187
6.2 Scheduling........................................................................................................................................... 189
6.2.1 Scheduling: Problem  D escription ........................................................................... 189
6.2.2 Scheduling: D em onstrated  O utcom es and C ontribution to K now ledge . 190
6.2.3 Scheduling: Further W o r k .................................................................................192
6.3 P ip e l in in g ..........................................................................................................................................193
6.3.1 Pipelining: Problem  D e s c r ip t io n ...................................................................... 193
6.3.2 P ipelining: D em onstrated O utcom es and C ontribution to K now ledge . 193
6.3.3 Pipelining: Further W o r k .....................................................................................196
6.4 C losing R e m a r k s ........................................................................................................................... 197
A E m ulator Test Program s 199
B Live R egister Identification A lgorithm  Trace 205




2.1 Spectrum  o f devices, from  A SIC  through to C P U ..........................................................  7
2.2 A L U -based hom ogenous array v.s. heterogeneous array.............................................. 12
2.3 Sim plified exam ple o f  a reconfigurable instruction  cell array (R IC A )....................  13
2.4 C om plete RICA  too lchain ........................................................................................................ 20
2.5 M achine descrip tion  file (M D F) syntax before the w ork o f  this thesis......................  21
2.6 M achine descrip tion  file (M D F) syntax after the w ork o f  this thesis.......................... 23
3.1 M odelling a serial m achine on another serial m achine.................................................... 29
3.2 M odelling com binatorial data paths on a serial m achine................................................  31
3.3 M odelling sequences o f  parallel data paths on a serial m achine................................  32
3.4 M odelled  system : reconfigurable core (sim plified), memory, and exam ple pe­
ripherals..........................................................................................................................................  35
3.5 Pseudo-code fo r m em ory in terface.......................................................................................  36
3.6 S im plified a d d  cell class im plem entation  pseudo-code................................................ 38
3.7 C ore execution loop pseudo-code.........................................................................................  38
3.8 P re-processor guarded sections in a typical cell type im plem entation header file. 40
3.9 Source code extract for auto-generating each cell type factory class, along with
a file-static instance o f i t ...........................................................................................................  41
3.10 Exam ple auto-generated source code resulting from  the pre-processor m eta­
program m ing in figure 3 .9 ........................................................................................................ 41
3.11 Source code extract for auto-generating look-up tables associating a hum an
readable nam e to each configuration w ord value, for each cell type.........................  42
3.12 Exam ple auto-generated source code resulting from  the pre-processor m eta­
program m ing in figure 3 .11.....................................................................................................  42
3.13 Exam ple configuration context involving only com binatorial operations, and 
another including a connection loop .....................................................................................  43
3.14 Cell action execution order for the exam ple step DFG given in figure 3.13(b). . 44
3.15 Visual representation o f  the kernels used in table 3 .2 ..................................................... 47
4.1 Toolchain overview: process o f  converting C source files into a set o f  configu­
ration contex ts..............................................................................................................................  51
4.2 Exam ple assem bly for a basic block containing 4 independent data paths. . . .  54
4.3 D ata flow graph extracted from  the assem bly in figure 4 .2 ........................................... 55
4.4 A ssem bly instructions from  figure 4.2 grouped by w hich independent data path
they belong to ...............................................................................................................................  55
4.5 D ata flow graph from  figure 4.3 scheduled onto a very sim ple array......................... 57
4.6 Exam ple schedule from  figure 4.5 show ing tem porary registers................................. 58
4.7 The abstract netlist resulting from  the schedule show n in figure 4 .5 .........................  59
4.8 The first step o f figure 4.5 m apped onto the array............................................................  60
4.9 The second step o f figure 4.5 m apped onto the array......................................................  61
4.10 The tasks perform ed by the scheduler— stages to  convert from  assem bly to  ab­
stract netlist...................................................................................................................................  62
L ist o f  figures
4.11 Exam ple RICA  assem bly w ith the m a i n  function (show ing a few o f  its basic 
blocks), and som e global data sym bols................................................................................ 65
4.12 Exam ple assem bly for a basic block. This exam ple contains 4  independent data 
paths..................................................................................................................................................  67
4.13 D ata flow graph (D FG ) extracted from  the assem bly in figure 4 .1 2 ..........................  67
4.14 A ssem bly instructions from  figure 4.12 grouped by w hich independent data
path they belong to ......................................................................................................................  68
4.15 Concept o f  predecessors and successors in the data flow g raph.................................. 69
4.16 Exam ple program  contro l flow g raph ...................................................................................  72
4.17 Exam ple basic b lock  assem bly and corresponding register life tim es........................ 75
4.18 Exam ple program  control flow graph, used for live register identification. . . .  77
4.19 Significant features o f  the exam ple C FG  from  figure 4 .1 8 ...........................................  78
4.20 Exam ple C FG  from  figure 4.18 show ing w hich input and output registers are 
found to be live or dead .............................................................................................................  80
4.21 Exam ple dem onstrating  a p roblem  w ith identifying live registers in nested loops. 82
4.22 Exam ple basic block show ing the assem bly, data paths extracted from  it, and 
resulting schedule........................................................................................................................  85
4.23 Flow  chart o f  generic list scheduling algorithm ................................................................  88
4.24 Illustration o f dependent and independent opera tions....................................................  88
4.25 A SA P and A L A P m obility  based list scheduling techniques applied to an arbi­
trary data flow g raph ...................................................................................................................  90
4.26 A SA P and A L A P m obility  based list scheduling techniques applied to another 
arbitrary data flow g raph ...........................................................................................................  91
4.27 Tree follow er scheduling algorithm  flow chart.................................................................. 92
4.28 Tree follow er scheduling algorithm  applied to the sam e arbitrary data flow graphs. 93
4.29 S tep data m odel produced by the scheduling algorithm  for the exam ple in fig­
ure 4 .22 ............................................................................................................................................  94
4.30 Illustration o f  the ‘rew ind’ register starvation avoidance m ethod ..............................  97
4.31 Illustration o f the ‘shuffle’ register starvation avoidance m ethod ............................... 99
4.32 Illustration o f the ‘sp lit’ register starvation avoidance m ethod........................................100
4.33 Exam ple from  figure 4.22 converted to steps........................................................................ 103
4.34 Step data flow graphs show ing m em ory read operation cascad ing ................................104
4.35 Tim ing diagram  for the step D FG  shown in figure 4 .3 4 .................................................... 105
4.36 Tim ing diagram  for the step D FG  shown in figure 4.34, w ith in ternally  pipelined 
m em ory access ce lls ....................................................................................................................... 106
4.37 A nalysis o f RM EM  operations................................................................................................... 107
4.38 Inform ation flow diagram  for a sim ple program ...................................................................110
4.39 Inform ation flow diagram s for different register reassignm ent m ethodologies. . I l l
4 .40 Influence o f  control flow on w hich step boundaries a given register represents
the sam e piece o f  in form ation ..................................................................................................... 114
4.41 Step count resulting from  m ultiple runs o f  the scheduling algorithm  on the D C T 
kernel, against availability o f  certain  key resources............................................................118
4.42 Total critical path o f the D C T  kernel, against availability o f certain  key resources. 119
4.43 Throughput o f the D C T kernel, against availability o f certain  key resources. . . 119
4.44 A chieved overlap, against availability o f certain  key resources......................................120
4.45 Step data flow graphs for the D C T m ain kernel, w hen the M U L resource is 
constrained to various degrees.................................................................................................... 121
IX
L ist o f  figures
4.46 Registers available fo r tem porary values in the D C T exam ple, w ith and w ithout
live register identification ............................................................................................................. 123
4.47 Registers available betw een basic b locks in the D C T exam ple, w ith and w ithout
live register identification ......................................................................................................... 123
4.48 Registers available for tem porary values in the gam m a correction exam pm le,
w ith and w ithout live register identification...........................................................................124
4.49 Registers available betw een basic blocks in the gam m a correction exam ple,
w ith and w ithout live register identification.......................................................................124
4.50 H istogram  o f register starvation avoidance techniques fo r the gam m a correction 
exam ple main loop, w ith live register identification enabled ....................................... 127
4.51 H istogram  o f register starvation avoidance techniques for the gam m a correction 
exam ple m ain loop, w ithout live register identification .....................................................128
4.52 N um ber o f steps resulting from  the scheduling o f the gam m a correction m od­
u le ’s m ain loop basic block, over a range o f  register instance coun ts...........................129
4.53 C hange in total critical path o f  the resource constrained gam m a correction m od­
ule, over a range o f  register instance coun ts...................................................................... 129
4.54 C hange in throughput o f  the resource constrained gam m a correction  m odule, 
over a range o f register instance coun ts...................................................................................130
4.55 L I 048 path length h istogram ..................................................................................................131
4.56 L 9 17 path length h istogram ....................................................................................................131
5.1 Typical program  on a reconfigurable processor, w ith pipelined kerne l...................136
5.2 Illustration o f softw are p ipelin ing ............................................................................................ 139
5.3 Exam ple kernel data flow graph before and after p ipelin ing.......................................... 143
5.4 Fill, loop and flush step sequence created for an exam ple kernel..................................147
5.5 Control flow betw een the steps o f a 3-stage pipelined kerne l.........................................148
5.6 Expanded control flow for figure 5.5 for a num ber o f  iterations....................................149
5.7 Internal control signals during execution o f  a non-pipelined kerne l............................ 153
5.8 Internal control signals during execution o f a pipelined kerne l......................................154
5.9 C onstruct used to preserve the final values o f kernel registers when single-step
pipelin ing ........................................................................................................................................... 156
5.10 C onstruct for supplying the initial value to kernel registers w hen single-step 
p ipelin ing ........................................................................................................................................... 157
5.11 Idle tim e resulting from  the m aster c lock .............................................................................. 158
5.12 A divider cell internally pipelined to 4 stages...................................................................... 160
5.13 Instruction slots for a cell w hich supports an internally p ipelined instruction. . . 162
5.14 Edges representing a data path containing either a com binatorial or internally 
pipelined divider ce ll..................................................................................................................... 163
5.15 M easured throughput o f the pipelined dem osaic 3x3 kernel, for a  range o f  target 
critical path length constra in ts....................................................................................................168
5.16 Pipeline stages and additional registers for each pipeline geom etry generated
fo r the dem osaic 3x3 kerne l........................................................................................................ 168
5.17 Im provem ent in throughput v.s. p ipeline depth for the dem osaic 3x3 kernel. . . 169
5.18 M easured throughput o f  the pipelined D CT kernel, for a range o f  target critical
path constra in ts................................................................................................................................172
5.19 Pipeline stages and additional registers for each pipeline geom etry generated
for the DCT exam ple..................................................................................................................... 172
x
L is t o f  figures
5.20 M easured throughput o f  the gam m a correction kernel before and after p ipelining. 177
5.21 P ipeline stages and additional registers fo r each p ipeline geom etry generated
for the gam m a correction kerne l.................................................................................................178
5.22 Im provem ent in throughput v.s. pipeline depth for the gam m a correction kernel. 179
5.23 T hroughput before and after autom atic pipelining for the H am ilton dem osaic
and iterative softw are d ivision exam ples.................................................................................181
5.24 P ipeline geom etries generated for the H am ilton dem osaic and iterative softw are 
division exam ples.........................................................................................................................182
5.25 C orrelation betw een the th roughput and pipeline geom etry graphs, show ing 
pipeline depth relaxation ........................................................................................................... 183
A. 1 C source code fo r the exam ple w ith four copies o f  the data path executing in ­
dependently, in para lle l.................................................................................................................. 199
A .2 Step data flow graph for the ‘para lle l’ exam ple program ’s main loop .........................200
A .3 C source code for the exam ple w ith tw o copies o f  the data path executing in­
dependently  in parallel, w ith another two copies o f the data path dependent on
these (thus extending the critical path ).................................................................................... 201
A .4 Step data flow graph for the ‘com binato ria l’ exam ple p rog ram ’s m ain loop. . . 202
A .5 C  source code for the exam ple w ith the data path executed inside a loop (which
hasn’t been unrolled), causing the m ain loop to consist o f  four iterations o f the
sam e configuration context executing in  sequence............................................................. 203
A .6 S tep data flow graph for the ‘sequential’ exam ple program ’s m ain lo o p ...................204
C .l D ata flow graph fo r the 3x3 dem osaic m ain loop, w ithout p ipelin ing .......................... 209
C.2 D ata flow graph for the 3x3 dem osaic m ain loop, w ith single-step pipelining. . 210
C.3 D ata flow graph for the D C T m ain loop, w ithout p ipelin ing ........................................... 211
C.4 D ata flow graph fo r the D C T m ain loop, w ith single-step p ipe lin ing ........................... 211
C.5 D ata flow graph for the gam m a correction m ain loop using com binatorial m em ­
ory reads, w ithout p ipelin ing  212
C.6 D ata flow graph for the gam m a correction main loop using com binatorial m em ­
ory reads, with single-step p ipelin ing ...................................................................................... 213
C.7 D ata flow graphs for the steps corresponding to the gam m a correction main
loop using internally  p ipelined m em ory reads, w ithout p ipe lin ing ............................... 214
C.8 D ata flow graph for the step corresponding to the gam m a correction  m ain loop
using internally  pipelined m em ory reads, w ith single-step p ipe lin ing .........................215
XI
List of tables
3.1 Execution speed for various standard benchm arks, norm alised to the speed of
the em ulator...................................................................................................................................  46
3.2 Com plexity  and relative execution speed (em ulator v.s. System C  m odel) for 
som e sim ple test p rogram s.......................................................................................................  47
4.1 Available instruction cell resource count fo r a hypothetical, artificially small 
RICA  array.....................................................................................................................................  54
4.2 All edges from  the exam ple data flow graph in figure 4.3, and the corresponding
assem bly in figure 4 .2 .......................................................................    56
4.3 All edges from  the exam ple data flow graph in figure 4.13, and the correspond­
ing instruction or register. .................................................................................................... 68
4.4 R egister inform ation for the basic blocks o f the exam ple in figure 4 .1 8 .................  79
4.5 Final record o f registers live on exit from  each basic block in figure 4 .18 .............  80
4.6 D CT kernel resource requirem ents, in term s o f  instruction cells on the target
arch itecture........................................................................................................................................ 118
4.7 Sim plified gam m a correction filter kernel resource requirem ents................................. 126
4.8 Post-routing statistics for the two m ost com plex kernels in a 3rd party  ISP. . . .  130
5.1 D em osaic 3x3 filter kernel resource requirem ents.............................................................. 166
5.2 Throughput o f the dem osaic 3x3 filter kernel before and after m ulti-step p ipelin­
ing......................................................................................................................................................... 167
5.3 Throughput o f the dem osaic 3x3 filter kernel before and after single-step pipelin­
ing .........................................................................................................................................................167
5.4 DCT kernel resource requirem ents........................................................................................... 171
5.5 Perform ance o f  the D C T  filter for various m ulti-step pipeline geom etries. . . . 171
5.6 Perform ance o f  a D C T filter for various single-step pipeline geom etries................... 171
5.7 G am m a correction filter kernel resource requirem ents......................................................175
5.8 Perform ance o f the gam m a correction filter kernel before and after pipelining,
using com binatorial m em ory operations............................................................................. 175
5.9 Perform ance o f the gam m a correction  filter kernel before and after pipelining,
using internally pipelined m em ory operations...................................................................... 175
5.10 Perform ance o f the H am ilton dem osaic filter before and after autom atic p ipelin­
ing, for a range o f  m aster clock periods.............................................................................. 180
B. 1 Trace o f the CFG  w alk for the live register identification exam ple........................ 206
B.2 Continuation o f  the CFG  w alk trace in table B . l .................................................................207
Abbreviations
A B I A pplication  binary interface. A convention follow ed by a com piler to ensure interoper­
ability w ith o ther program s on a particular platform . Defines the convention o f  w hich 
registers are reserved for special purposes, how argum ents are passed to functions, ca ll­
ing conventions, the form at o f  the stack fram e, etc.
A D D C O M P A RICA  instruction nem onic representing  an addition or com parison operation. 
These require sim ilar hardw are, so w ere later com bined into the sam e cell type.
A L A P As late as possible.
A L U  A rithm etic logic unit.
API A pplication program m ing interface. A set o f  functions providing a high-level interface to 
certain  com m on or low-level functionality.
A R M  A dvanced RISC m achine. A ubiquitous em bedded m icroprocessor architecture.
A SA P As soon as possible.
A SIC  A pplication-specific integrated circuit. Custom  silicon created for a particular task.
A SIP A pplication-specific instruction set processor. A type o f  m icroprocessor w here application- 
specific functionality  has been provided through additional instructions. Such instruc­
tions are typically very com plex, perform ing high-level functionality.
CFG Control flow graph.
CGRA C oarse-grained reconfigurable array. An um brella term  for particular types o f dynam ­
ically reconfigurable architectures w hich operate on the w ord level (rather than the bit 
level).
CPU Central processing unit. This term  can refer to any type o f  p rocessor that can perform  
com plex control flow.
D C T D iscrete cosine transform . A com m on phase-agnostic m athem atical transform  used in 
im age and audio com pression.
DFG  D ata flow graph.
D M A  D irect m em ory access. H ardw are that allows m em ory operations such as block transfers 
to  be perform ed in the background, w ithout continuous intervention from  the CPU.
D R A  D ynam ically  reconfigurable array. An um brella term  fo r data path architectures that are 
intended to be reconfigured many tim es during norm al operation.
D SP  D igital signal processor. A  type o f em bedded processor w ith an instruction set optim ised 
for perform ing com m on signal processing tasks.
A bbreviations
FIR  Finite im pulse response filter. A type o f d igital filter w ith no feedback.
FPG A  Field program m able gate array. A ubiquitous data path reconfigurable architecture 
m ostly used in system -on-chip prototyping.
FPOA Field program m able object array. A  type o f coarse-grained heterogeneous data path 
reconfigurable architecture.
FU Functional unit. The general term  for a hardw are block that perform s the native operations 
o f a particular architectures. These can be individual gates, A LUs, o r even m icroproces­
sors.
G ALS G lobally asynchronous, locally synchronous. A design pattern used in h ighly m ulti­
core architectures, to  m ake it conceptually  easier to pass inform ation betw een the cores.
GCC The GNU com piler collection (form erly the GNU C com piler). An open-source re- 
targetable com piler fram ew ork.
G IM PL E  G N U  variant o f  SIM PLE— an internal representation used in the G C C  com piler, 
based on SSA.
G PL GNU public license. A license under w hich open-source softw are can be released, en­
suring the right o f  ‘copy-left’.
G PU  G raphics processing unit. C ustom  silicon w hich perform s highly parallel, high through­
put operations used in 3-D graphics. C urrent G PUs are based on arrays o f  SIM D  proces­
sors.
H DL Flardware description language. A com puter-parseable language used to express the 
design o f hardw are in a scalable, m odular fashion.
H PC H igh-perform ance com puting. A field o f  com puting w here problem s are solved using 
clusters o f com puter nodes w hich run m ultiple copies o f  the sam e program  (each operat­
ing on different parts o f the problem ), connected via a high bandw idth netw ork/interconnect.
ID Short for ‘iden tity ’.
IIR Infinite im pulse response filter. A type o f  digital filter w here the output is a function o f the 
output delayed by som e num ber o f  iterations (i.e. feedback).
ILP Instruction-level parallelism . W here m ultiple instructions in sequence can be executed in 
parallel. C.f. thread-level parallelism , w here parallelism  is achieved by having m ultiple 
independent instructions stream s executing concurrently.
IP Intellectual property.
ISP Im age signal processor/processing. A series o f algorithm s that m anipulate digital im ages, 
to com pensate for artefacts in the sensor and optics. Can also refer to an A SIC im ple­
m enting this functionality.
JU M P A RICA instruction nem onic allow ing the program  control flow to be affected by m od­
ifying the value o f  the program  counter.
xiv
A bbreviations
L H S Left-hand side (o f an equation  or relationship).
LLVM  T he low-level virtual m achine. A pow erful new  com piler fram ew ork /  optim ising 
linker w hich operates around an interm ediate representation  w hich describes function­
ality in term s o f the instruction set o f a highly generic virtual m achine.
M A C  M ultiply accum ulate. A com m on D SP operation.
M D F M achine description file. A  file form at used to describe a RICA  core, in term s o f cell 
types present, instance counts, locations in the array, tim ing inform ation, and other p rop­
erties.
M O V A RICA instruction nem onic representing a ‘m ove’— the transfer o f data from  one reg­
ister to another. This concept has no d irect physical counterpart in the real hardw are, but 
is used to represent fan-out.
M U L  A RICA  instruction nem onic representing  a m ultiplication operation.
N R E N on-recurring  expenses. D esign cost.
N U M A  N on-uniform  m em ory architecture. A design pattern in com puter architectures w here 
different types o f m em ory are present, in separate address spaces. This im proves m em ory 
bandw idth, but m akes program m ing m ore difficult.
O E M  O riginal equipm ent m anufacturer.
O S O perating system . Low -level softw are running on a particular platform .
PC Program  counter. A  register that controls w hich instruction/configuration context is to be 
executed.
PD C  Pipeline depth counter. A hardw are concept introduced in this thesis, for hardw are- 
assisted (single-step) pipelining on RICA.
RAM  Random -access memory.
RGB Red/green/blue pixel form at.
RH S R ight-hand side (o f an equation or relationship).
R IC A  R econfigurable instruction cell array. The dynam ically  reconfigurable architecture tar­
geted by this thesis.
R ISC  Reduced instruction  set com puter. A lso know n as regular (uniform ) instruction set com ­
puter, or load/store architecture.
R M E M  Read memory. A  RIC A  instruction nem onic representing com binatorial reads from  
data memory.
R R C  Reconfiguration rate controller. A  type o f instruction cell in RICA  w hich controls the 
program  counter, affecting control flow.
RTL Register transfer level.
xv
Abbreviations
SIM D  Single instruction, m ultiple data. A type o f  com puter architecture w here m ultiple A LU s 
perform  the sam e operation on several data sets at once, follow ing a com m on control 
flow.
SO C  System -on-chip. A form  o f A SIC  w here an entire com puter (CPU, m em ory and periph­
erals) is integrated into a single die or package.
SR A M  Static random -access memory.
SR B U F A RICA  instruction nem onic representing  reading from  a stream  buffer.
SSA Single static assignm ent. A representation o f data flow graphs inside a com piler, that 
m akes it easier to analyse.
TLM  Transaction-level m odel. A  form  o f  hardw are m odel available in System C.
U L IW  U ltra long instruction word. A  type o f  V LIW  processor.
V L IW  Very long instruction word. A  type o f p rocessor w ith m ultiple ALUs.




A bstract netlist A netlist describ ing only w hich connections exist, but not how  those con­
nections are m apped onto  the interconnect. This m eans the tim ing inform ation is not 
accurate.
A ctive register A register is active in a  basic block if it is read from  or w ritten to in the assem ­
bly instructions belonging to  that basic block. A fter the basic block has been parallelised 
by a scheduler, an active register m ay not actually  need to be w ritten to if the data that it 
stores is only used inside that basic block.
A ssem bly A ssem bly language (also loosely know n as assem bler). A low-level but hum an- 
readable language w hich directly  corresponds to instructions in the target arch itec tu re’s 
instruction set. A ssem bly provides a thin layer o f abstraction above m achine language, 
w here the instructions and operands w ould be coded directly in binary.
B asic b lock  A group o f  assem bly instructions that are always executed in sequence. Basic 
b locks begin w ith a unique label, w hich is used to  identify  them . B asic blocks either end 
by sim ply passing control to the next basic block in sequence, o r they end w ith a jum p  / 
branch instruction, w hich optionally  passes control to  another nam ed basic block. Basic 
b locks are the sm allest indivisible unit o f control flow.
B itstream  Raw binary data that w ill becom e the contents o f  a reconfigurable arch itecture’s 
program  memory, in order for it to execute a given program . This is generated by a tool, 
from  the output o f  a  toolchain, beginning w ith a hum an-readable language o f  som e sort.
C oarse-grained  Refers to the native w idth o f the in terconnect o f  a reconfigurable architecture. 
C oarse-grained m eans that the native w idth is m ore than 1-bit, therefore functional units 
will perform  w ord-level operations.
C om pilation unit A C source file plus any other files that it ( ( i n c l u d e s ,  and any that they 
# i n c l u d e ,  etc. This is the scope in  w hich a com piler typically operates in.
C om piler A  softw are tool that converts source code from  a high-level program m ing language 
(e.g. C) into a low er-level language to be interpreted  by a m achine. Typically this is 
assem bly, w here the instructions correspond to the instruction  set o f  a particular m icro­
processor architecture, obeying the rules o f  an ABI.
C onfiguration context The data com pletely describing the configuration o f a reconfigurable 
arch itecture’s functional units and interconnect at a given m om ent in tim e, in o rder for it 
to form  a specific set o f  data paths. R econfiguration consists o f  loading a new configura­
tion context. A lso referred to as a w ide instruction in som e literature.
Control flow graph A graph describing how program  control can pass betw een the basic 
blocks o f a program , during execution. The nodes are basic blocks, and the edges are 
possible directions o f  control flow.
xvu
N om enclature
D ata flow graph A graph describing how the operations in a basic b lock  are connected to ­
gether into data paths. The nodes are operations (m apping to  functional units), and the 
edges indicate data dependencies betw een  the operations.
Data path A chain o f connected operations, w here the result o f  one operation is used as an 
input to one or m ore dependent operations in the chain. O perations belong to the sam e 
data path if there is at least one unbroken path connecting  them  (involving any num ber 
o f  other operations in betw een).
Dead register Registers used in the instructions o f a basic block to pass inform ation to  other 
instructions o f  the sam e basic block, w here the lifetim e o f  this inform ation lies entirely 
w ithin that basic block.
D ependent operations Two operations are said to be dependent if  one cannot begin before the 
other has produced a result. D ependent operations cannot be executed in parallel, but can 
be executed com binatorially  (if the architecture supports this).
D orm ant register A register that is not read from  or w ritten to in a basic block or step, but 
w hich stores inform ation that is used later in the program .
E m ulator A softw are tool that sim ulates a specific m icroprocessor and system , w here the 
behaviour is m odelled at a  h igh level, generally leading to  faster execution.
Fine-grained  Refers to the native w idth o f  the interconnect o f  a  reconfigurable architecture. 
F ine-grained m eans that the native width is 1-bit, therefore functional units will perform  
bit-level operations.
Functional unit The elem ents o f a hardw are array which operate on data. These can be logic 
gates, ALUs, or even processor cores.
Im m ediate A small integer constant that can be coded directly into spare bits in an instruction. 
Im m édiates are com m only supported in R ISC  instruction sets.
Independent operations Two operations are said to be independent if  neither depends on the 
result o f  the other. Independent operations can safely be executed in parallel.
Input register A register that brings inform ation into a basic block, from  earlier in the pro­
gram . This occurs when the operand o f an instruction is a register that has not yet been 
w ritten to in that basic block.
Instruction A nem onic used in assem bly to associate a hum an-readable nam e w ith a bit pat­
tern corresponding to a particular operation in a m icroprocessor’s instruction set. An 
instruction consists o f  a nam e, follow ed by operands. In a RISC processor, the operands 
are always registers or im m ediate constants.
Instruction cell The nam e given to the functional units o f  a coarse-grained reconfigurable 
processor, w here the functional units correspond in functionality  to instructions com m on 
in RISC instruction sets.
Kernel An inner loop in a com pute-intensive application, w hich norm ally runs for many con­
secutive iterations. In RICA , this term  is also used to refer to a subset o f  these w here the
xviii
N om enclature
loop body can fit entirely  w ithin a single configuration context. This is the m ost efficient 
way to execute, as the configuration context only has to be loaded once during the run 
tim e o f the loop.
L ine buffer A nother nam e for a stream  m em ory, used in the context o f im age signal process­
ing, w here local storage is norm ally for lines o f the im age near the line currently  being 
processed.
L ine m em ory A nother nam e for line buffer.
Live register A register that is read from  or w ritten to in the instructions o f a basic block, 
w here that inform ation is needed later in the program  (outside o f that basic block).
M apper A softw are tool that determ ines how paths should be rendered on to the reconfigurable 
interconnect o f  a  particular architecture, in o rder to achieve the connections described in 
an abstract netlist, w ithout conflicts. The output is a routed netlist.
N etlist A file describ ing the connectiv ity  betw een functional units in  a reconfigurable architec­
ture. T he file describes the graph for each configuration context, w here the nodes are the 
functional units (cells), and the edges are the connections. Edges can contain properties 
that describe the path taken along the interconnect.
O peration  A  logical operation to transform  data in som e way. O perations typically  have two 
inputs, and produce one result. O perations are represented  by instructions in assembly.
O peration  chain ing The capability  o f a given hardw are architecture to execute dependent op­
erations in the sam e clock cycle /  iteration. This relies on the ability to execute these 
operations com binatorially— a physical w ire brings the result o f one into the input o f the 
other. This is im possib le on m ost architectures, since results norm ally have to  be w ritten 
to registers, and read back in another cycle /  iteration.
O utput register A register that is w ritten to in the instructions o f  a basic block, w here the 
value is not subsequently  overw ritten (clobbered) by another instruction in the sam e basic 
block. The stored inform ation m ay or m ay not be needed in later basic blocks. O utput 
registers can therefore be live o r dead, respectively.
Pipeline register A  register not originally  appearing in the instructions o f  a basic block, which 
a scheduler infers during pipelin ing o f a basic block. P ipeline registers delay data be­
tw een pipeline stages, and are inserted along any connection that spans pipeline stages.
Profile A file containing tim ing inform ation, execution counts, etc. derived from  executing a 
given program  on the target architecture (or in a  sim ulation o f it).
R econfigurable A term given to com puting architectures that are not hardw ired to perform  a 
single function— i.e. they can change the shape o f  their data paths in order to change 
the functionality o f  the device. This can be done electronically, in-field. The hardw are 
consists o f  functional units and in terconnect (called a fabric), on top o f w hich data paths 
are rendered.
R outed netlist A netlist augm ented with path inform ation, show ing how each o f  the connec­
tions are physically  realised on the reconfigurable interconnect o f  the target architecture.
xix
N om enclature
Scheduler A softw are tool that converts the basic blocks o f  a linear assem bly into parallel data 
paths that are to be rendered onto a reconfigurable architecture. In general, a scheduler 
extracts parallelism  from  a sequential stream  (of operations).
Sim ulator A softw are tool that sim ulates a specific piece o f  hardw are, w here the behaviour 
is m odelled at a  relatively low level, e.g. register transfer level in an H D L  sim ulator, or 
the com m unication betw een individual functional units o f a reconfigurable architecture 
simulator.
Step A nother nam e for configuration context, used specifically for RICA.
Stream  m em ory Local on-chip random -access m em ory used as local storage in h igh-bandw idth 
stream ing applications. S tream  m em ory is norm ally partitioned into m ultiple banks to  in­
crease the bandw idth.
Target architecture T he particular com puting architecture that a program  is intended to exe­
cute on.
Tem porary register A register inferred during scheduling/parallelisation to store the value of 
an edge in the data flow graph that has been split across the boundary betw een steps. The 




The choice o f  platform  for m any m odern digital signal processing tasks in em bedded sys­
tem s is often lim ited to application-specific in tegrated circuits (ASICs), since off-the-shelf 
program m able architectures such as D SPs and m icroprocessors cannot m eet the throughput re­
quirem ents, w hereas reconfigurable hardw are such as field-program m able gate arrays (FPG A s) 
require too m uch area and power.
For applications that dem and an elem ent o f  reprogram m ability , stream ing processors (such as 
those offered by A m bric j 1 ] and SPI [2]) are becom ing an increasingly  attractive solution, 
w hich im prove on throughput by providing m ultiple processing elem ents/cores w ith an inter­
connect structure suited to stream ing. However, these processing elem ents— usually based on 
regular D SP designs— often equate to significant silicon area.
A lternatively, coarse-grained dynam icaly reconfigurable architectures (DR A s) offer a high de­
gree o f parallelism , sufficient to achieve high throughputs [3][4]. Thus few er cores are required 
for a given application, leading to a m uch low er area overhead. C oarse-grained D RA s, such 
as instruction cell based processors [5][6], provide a high degree o f  instruction chaining inside 
the core, by allow ing arbitrary connections to be m ade betw een the various functional units 
v ia  a configurable routing network. This allow s quite com plex data paths to  be rendered onto 
the fabric and executed in a single configuration. This m akes these architectures particularly 
suitable to stream  processing, as few er fetches from  program  m em ory are required.
T he classes o f  com puting architectures and the languages used to  program  them  are covered 
in chapter 2. The m ain observation is that high throughput and area efficiency are com m on 
properties o f  data path m achines (due to parallelism ), w hereas ease o f  program m ability  and 
flexibility are com m on properties o f  m icroprocessors and their derivatives (due to arbitrary 
control flow). To get the best o f both w orlds, reconfigurable com puting typically  involves 
coupling a data path m achine w ith a m icroprocessor. M ost differences are in the degree o f 
coupling betw een the two.
D ata path m achines are program m ed using tools from  the A SIC  design world, since this is 
w hat they m ost resem ble; w hereas m icroprocessors are program m ed from  high-level languages. 
Therefore, the co-processor approach com m on in reconfigurable com puting generally involves 
having to partition an application such that the data path intensive parts are im plem ented on the 
data path m achine, leaving the control flow intensive parts to be im plem ented on the m icropro­
cessor.
Recent innovations in  the design o f data path m achines have allow ed them  to be in control o f 
their own reconfiguration. Furtherm ore, they can be reconfigured very rapidly (e.g. m illions 
o f  tim es per second), w hich m akes them  able to achieve control flow sim ilar to a regular m i­
croprocessor. C om plete applications can therefore be m apped to a single architecture/device,
Introduction
using a single language and code base. This significantly reduces design tim e and im proves 
m aintainability.
The reconfigurable instruction cell array (RICA  [5]) represents a fam ily o f  dynam ically  re- 
configurable devices: the technology is scalable— from  array sizes o f  tens o f  cells, to arrays of 
thousands o f cells (or m ore). However, there is a trade-off involved: very high-throughput tasks 
are only possible on large arrays, w here there is sufficient parallelism  available. H owever, large 
arrays have larger configuration sizes, w hich lim its the rate at w hich they can be reconfigured 
and the num ber o f  configurations that can be stored, w hich in turn affects the ability to execute 
arbitrary (general purpose) code, such as for control. Conversely, i f  the device is to be used 
m ore like a general purpose processor but w ith the occasional need to perform  parallel opera­
tions o f m oderate com plexity  (com pared to V LIW s), then its ability to execute large data paths 
is lim ited.
Since both scenarios have significant com m ercial applicability, it is im portant to be able to 
address both, and any com bination in betw een. Therefore, the C to RICA  m apping tools and 
sim ulation tools are required to be generic enough to  accom m odate the com plete range o f  de­
vices, and beyond (for research).
This thesis proposes and dem onstrates algorithm s that can be used to  create tools that attem pt 
to do this. Two areas o f the tool chain are considered in this work:
•  A scheduler for extracting parallelism  from  serial code (resulting from  a traditional soft­
w are com piler), and m aking sure that hardw are constraints are conform ed to.
•  A high-speed softw are em ulator for the target architecture, designed to be easily  extended 
w ith new  instruction  cells and hardw are functionality. This is the focus o f chapter 3.
Scheduling is further split into the follow ing com ponents:
•  Register and resource constrained scheduling o f  serial code onto parallel architectures 
supporting operation chaining. C hapter 4. This is how tim e-division m ultiplexing is 
achieved.
•  D ata path pipelining— exploiting existing hardw are techniques to autom atically  convert 
a static com binatorial data path into a pipeline, then m ake use o f rapid reconfiguration 
to apply softw are pipelin ing techniques to this pipeline. A sim ple enhancem ent to the 
hardw are is also proposed, to m ake this m ore efficient for larger cores. C hapter 5.
P ipelining is the m ain contribution o f  this work. It allows for dram atic increases in throughput, 
sufficient to allow  the target architecture to com pete with hard-w ired A SIC im plem entations. 
The scheduling w ork is needed to create the data sets that the p ipelining algorithm  w orks on. 
Em ulation is a key com ponent in providing feedback-directed optim isation, for use in im prov­
ing the scheduling and code layout.
To expedite the developm ent o f  applications targeting these coarse-grained D RA s, it is de­
sirable to program  them  from  the sam e languages that the developers use to prototype their 
algorithm s. This is m ost often high-level languages such as C. The configuration contexts o f
9
Introduction
such an architecture, if  considered to  be analogous to  instructions in a conventional m icropro­
cessor, m akes it possib le to  w rite a back-end to a conventional com piler (such as G C C ) to target 
these architectures. The types o f  instruction cell resources in the array are (by design) sim ilar 
to the instructions available in a conventional R ISC  (load/store) instruction set. It is these that 
are represented by instructions in the com piler. D espite being able to reconfigure very rapidly, 
each configuration context m ust do a lot m ore w ork than a single instruction, as the sw itching 
tim e is still several orders o f m agnitude slow er than that achievable with a conventional m icro­
processor. T herefore, m ultiple instructions produced by the com piler m ust be m apped to each 
configuration context. This process is called scheduling , and is the focus o f  chapter 4.
Perform ance is optim ised by attem pting to m atch the size o f  each kernel— inner loops w here 
m ost o f  the execution tim e is spent— to the available resources, allow ing them  to fit into a 
single configuration context. This allow s the configuration to persist for m any clock cycles, 
operating on new data on each cycle. This increases throughput, since no tim e is spent having 
to reconfigure the core betw een successive iterations. It also decreases pow er consum ption, as 
the configuration only needs to be fetched from  program  m em ory (or cache) once— upon first 
entering the kernel— rather than on every iteration. However, the resulting data paths can often 
have a long critical path, leading to poor tem poral utilisation  o f  the functional units, since they 
have to w ait until all functional units have com pleted before operating on the next batch o f  data, 
w hich lim its the throughput.
Pipelining provides a way o f  starting to operate on a new batch  o f  data before an old one has 
com pleted. Thus, this allows the functional units o f  m ultiple stages o f the kernel to be active 
concurrently; each operating on a d ifferent batch o f data. O thers have devised loop pipelining 
techniques for reconfigurable architectures [7, 8, 9], w here successive iterations o f  the loop 
are replicated in hardw are, and offset from  each other to  deal w ith any data dependencies b e ­
tw een the iterations. These are m ost suitable for large reconfigurable architectures with m uch 
longer reconfiguration tim es, w here there are sufficient resources fo r the entire loop body to be 
replicated m any tim es. T he technique allows com plete kernels that w ere m apped to a single 
configuration context, to have their critical path length decreased by the addition o f  pipeline 
stage registers. C hapter 5 presents two approaches to  pipelining dynam ically-reconfigurable 
arrays. In the first, p ipeline filling and flushing are achieved through dynam ic reconfiguration. 
This is an entirely  softw are approach. In the second, changes are m ade to the hardw are, to 
allow  filling and flushing to be incorporated into the single kernel configuration context, thus 
significantly reducing the program  m em ory overhead (especially  for very deep pipelines).
P ipelining was dem onstrated  to achieve significant im provem ents in throughput— up to an order 
o f m agnitude in the exam ples presented in this thesis. The achievable speed-up scales w ith the 
size o f the data paths and the size o f the core. For use in im age signal processing, throughputs o f 
180-380M Pixels/s w ere dem onstrated, w hich is com petitive with hard-w ired A SIC  solutions. 
T he em ulator presented in this thesis is several orders o f  m agnitude faster than com peting 
softw are-based solutions, and for small cores, approaches real-tim e perform ance. This makes 




The w ork o f  this thesis is backed by the follow ing publications:
•  “A utom ated  D ynam ic Throughput-constrained Structural-level P ipelin ing in Stream ing  
A pp lica tions”L10]: Presents an algorithm  and m ethodology for the autom atic pipelining 
o f configuration contexts on dynam ically  reconfigurable data path m achines. The m ethod 
utilises dynam ic reconfiguration to perform  pipeline filling and flushing, to avoid changes 
to the hardw are. The user specifies a tim ing constraint (i.e. target critical path), and the 
algorithm  constructs as m any pipeline stages as needed in order to  m eet it.
•  “A utom atic dynam ic structural-level p ipelin ing  in reconfigurable processors  ” [ 11]: Im ­
proves on [10] by autom ating the choice o f  p ipeline critical path constraint, allow ing for 
pipelining to be com pletely  autom ated.
•  “Extensible software em ula tor fo r  reconfigurable instruction cell based  p ro cesso rs ’̂  12]: 
P resents a novel serialisation  algorithm  out o f w hich a data path sim ulator can be m ade, 
and dem onstrates its application to reconfigurable com puting.
These publications can be found in appendix D.
As o f the tim e o f w riting, there have been no publications on the scheduling w ork (chapter 4). 
This w as due to com m ercial sensitivity o f the algorithm s em ployed therein. However, the w ork 
provides a foundation for the published w ork on pipelining. Future publications are p lanned on 
som e o f the novelties described in that chapter.
1.2 Novelty
This thesis presents the follow ing contributions to know ledge, grouped by purpose, the m ost
significant first.
1.2.0.1 Im proving throughput o f  com putationally  intensive inner loops by pipelin ing
D ynam ic pipelining: (section 5.3) Proposes the idea o f rendering custom  pipelines on to loop­
ing configuration contexts to  reduce the effective critical path, thus increasing perfor­
m ance. D ynam ic reconfiguration is used to perform  pipeline filling and flushing.
Pipeline stage allocation algorithm : (section 5.3.1) Describes the algorithm  used to deter­
m ine w here to insert pipeline stage registers into data paths on a coarse-grained reconfig­
urable architecture, in o rder to m eet a given target critical path constraint.
Single-step  pipelining: (section 5.5) Proposes hardw are m odifications that allow  filling and 
flushing to be perform ed via the sam e configuration context as the pipelined loop context, 
to reduce the m em ory footprint and increase the num ber o f pipeline stages that can be 




A utom ating the choice o f  tim ing constraint: (section 5.6) Shows how the tim ing constraint 
can be derived autom atically, such that the m axim um  possib le perform ance can be achieved 
w hilst m inim ising the resources used.
Support for in ternally  p ipelined  cells: (section 5.7) D escribes m odifications to the p ipelin ­
ing algorithm  that allow  a pipeline to be constructed involving cells that are internally 
p ipe lined .1 This is useful for hiding m em ory latency, and decreases the pipelined critical 
path, leading to higher achievable throughputs.
1.2.0.2 P rogram m ing dynam ically  reconfigurable architectures using a single code base 
from  a h igh-level language
Live register identification: (section 4 .7 .1) A n algorithm  for determ ining w hich registers con­
tain live inform ation at each stage during execution o f  a program . This is inferred directly 
from  the assem bly instructions produced by a com piler. This inform ation can be used to 
im prove the parallelism , by m aking m ore registers available for storing tem porary val­
ues over the boundaries betw een configuration contexts. It also frees registers for use in 
pipelining.
G lobal register reallocation: (section 4.12) Extends upon the live register identification algo­
rithm  to track  the flow o f  unique pieces o f  inform ation through registers. D em onstrates 
how this can be used to im prove routability, by determ ining when it is safe to move infor­
m ation into d ifferent registers that are closer to w here the inform ation is used,2 w ithout 
affecting the behaviour o f the program .
R egister starvation  avoidance: (section 4.10) T he process o f  packing data paths together into 
configuration contexts typically involves using m ore registers than the com piler orig i­
nally referenced in the assem bly. I f  insufficient registers are available, scheduling fails. 
A  series o f  m ethods are described to w ork around this, by gradually  reducing the paral­
lelism  until a  valid schedule can be form ed.
Scheduling algorithm  and associated data m odel: (section 4.9.2) A scheduling algorithm  for 
packing the data paths o f basic blocks generated by a com piler into a sequence o f  con­
figuration contexts on a reconfigurable architecture. This involves replacing the registers 
chosen by the com piler w ith w ires w here possible, o r otherw ise renam ing registers in 
order to pack the data paths together in parallel.
Support for m em ory cascading: (section 4.11.2) Proposes an algorithm  for analysing the de­
pendencies betw een m em ory access operations w ithin a configuration context, allow ing 
dependent com binatorial m em ory accesses to be cascaded together. This avoids hav­
ing to split certain  inner loops into m ultiple configuration contexts, resulting in higher 
perform ance.
'i .e . the result o f  an operation appears in a different iteration to when the corresponding input values were
sampled.
“reducing the am ount o f  interconnect needed.
5
Introduction
1.2.0.3 H igh-speed sim ulation  o f d ynam ically  reconfigurable architectures
L oad-tim e serialisation: (section 3.1.3) S im ulating dynam ically  reconfigurable architectures 
that reconfigure many m illions o f  tim es per second using conventional RTL sim ulation 
is inefficient due to the overhead o f  serialising the data paths at run-tim e. The key con­
tribution here is to perform  this serialisation before executing the program , so that each 
configuration context is only processed once. This results in  a significant perform ance 
gain.
Serialisation  algorithm : (section 3.3.2) The algorithm  used to serialise the data paths o f  a 
coarse-grained architecture, and associated data m odel for storing this serialisation for 
la ter execution by an interpreter.
1.3 Structure
C hapter 2 describes the overall background o f  this work, d iscussing the concepts behind recon­
figurable com puting and the tools that are used.
The next chapters— E m ulation (chapter 3), Scheduling (chapter 4), and Pipelining (chapter 5)—  
cover the m ain body o f the work. Each o f these follow  the sam e structure: they begin w ith a 
high-level description o f the problem  that they address, including a list o f  aim s and objectives, 
along w ith a sum m ary o f  the novelties that are covered in that chapter. This sum m ary is fo l­
low ed by a review  o f the background and literature relevant to that work. The m ain m aterial 
follows, describing the concepts and algorithm s. These are follow ed by relevant results and 
analysis. The chapters close w ith som e final w ords that explain the im portance o f w hat was 
covered, and how that links to the chapters w hich follow.
C onclusions are presented in chapter 6, w hich restates the aims, objectives, and novelties, and 




This chapter provides an overview  o f the technologies behind reconfigurable com puting [ 13]—  
both the hardw are itse lf and the softw are used to develop applications fo r them . Section 2.3 
describes the foundation o f  hardw are and tools that existed before the w ork o f  this thesis, and 
defines the scope o f the problem  that the w ork o f this thesis is intended to solve.
Digital com putation m achines can be classified according to certain  properties that they exhibit. 
Som e o f  these properties are largely related, and in opposition; an exam ple being program m a­
bility v.s. th roughpu t1. If we plot various fam ilies o f  devices along an axis o f  p rogram m abil­
ity/throughput, w e see that they form  a spectrum , as seen in figure 2.1. We shall also consider 
cost from  the perspective o f  an original equipm ent m anufacturer (O EM ) o f  consum er electronic 










ASIP C P U /D S P
Throughput
Figure 2.1: Spectrum of devices, from ASIC through to CPU.




2.1.1 Application-Specific Integrated Circuits
At one end sits fully custom  devices— A SIC s— w here the data paths required by a particular 
application are provided hard-w ired  in the silicon. A SICs are the pinnacle o f  th roughput2 and 
low pow er consum ption3, but at the sam e tim e are the least generic /  program m able. Sim ilarly, 
they have the highest non-recoverable engineering (NRE) costs, and this continues to increase 
as the silicon processes m ove to sm aller and sm aller feature sizes [14]. However, for high 
volum e products, they usually offer the low est per-unit cost, since the die size will be sm all, 
and intellectual property (IP) licensing royalties are likely to be low er since a h igher proportion 
o f the design will be in-house.
For relatively low -volum e products, the N R Es o f  fully-custom  A SIC design can be significantly 
reduced by using structured A SIC s [15]— m ask program m able devices. T hese allow  the sam e 
parallelism  as fully-custom  devices, w ith a little ex tra area overhead, and reduced perform ance 
in term s o f pow er consum ption and throughput, due to the standard part (silicon) having m ore 
resources than needed and less optim um  placem ents for a given design, and thus additional 
w ire lengths. However, structured A SICs are still fixed-function (i.e. not program m able).
2.1.2 Field Programmable Gate Arrays
M oving further into the spectrum  from  the left, w e com e to field-program m able gate arrays 
(FPGA s) (such as those provided by A ltera [16], X ilinx [17], and Actel [18]). These m aintain 
the parallelism  available in A SICs, but allow  a standard part to be reprogram m ed to perform  
very different functions. FPG A s are fine-grained reconfigurable fabrics, as their in terconnect 
and operations are at the bit-level. This granularity  incurs a trem endous area overhead, due 
to the drivers and sw itching needed to support the hierarchical in terconnect network. This 
also has an effect on perform ance— both pow er and throughput. The N REs are quite low  with 
this approach, as the parts are off-the-shelf. However, as a result o f the area overhead com ­
pared to A SICs, they com e at a high per-unit cost. This is perfectly  acceptable for low -volum e 
products, or w here re-configurability is a key requirem ent. In h igh-volum e products w here 
re-configurability is im portant, but only for a subset o f the ch ip ’s functions, an increasingly 
com m on w ork-around for the per-unit cost o f FPG A s is to use em bedded FPG A s as IP blocks 
in an otherw ise custom  ASIC.
A nother device fam ily that begins to populate the centre o f  the spectrum  is field-program m able 
object arrays (FPOAs) [19, 20]. These are m ore coarse-grained data-path m achines, w hich 
have the potential for reduced area overhead com pared to FPGAs, since the coarser granularity 
reduces the com plexity (and area) o f  the in terconnect network. However, they are also less 
flexible as a  result (and are often dom ain-specific).
The high reconfiguration tim e for FPG A s and som e o f the m ore coarse-grained variants, makes 
them  poor at perform ing control-flow  intensive tasks. Two w ork-arounds are possible: either 
statically map all the logic for each path o f the control flow into a  single configuration, or couple
“due to the parallelism  o f the hardw are exactly m atching that o f the algorithm s.
'due  to direct wire connections and optim um  placem ent.
8
B ackground
the device w ith a m icroprocessor. T he latter involves partitioning the design so that control-flow  
intensive parts o f  the application are im plem ented on the m icroprocessor, and the data-path in­
tensive parts are m apped on to the reconfigurable fabric. T he reconfigurable fabric therefore 
acts as a co-processor in this arrangem ent. This approach is becom ing increasingly com m on 
in high-perform ance com puting (HPC) [21, 22, 23], However, the co-processor arrangem ent 
requires tw o separate toolchains and design m ethodologies to be used [24], increasing the de­
velopm ent tim e. This concern has in part lead to the introduction o f  m icroprocessor IP blocks 
(hard m acros) in  high-end FPG A s, w ith tool chains that sim plify the interactions. T he use o f 
hard m acros also serves to reduce the package count, and to  increase the bandw idth  betw een 
the processor (hard m acro) and co-processor (reconfigurable fabric).
2.1.3 Microprocessors
A t the far right o f  the spectrum  lies m icroprocessors (CPU s) and their variants (DSPs). These 
represent the u ltim ate in flexibility, but offer the poorest throughput for arbitrarily com plex 
algorithm s. As o ff-the-shelf com ponents, they offer the low est N R Es, but for high volum e 
products the cost is dom inated by the additional area. For use in system -on-chip, m ost CPUs 
also incur IP licensing fees.
A lot o f w ork has gone into im proving the perform ance o f  m icrprocessors. One m ethod has 
been to optim ise the layout o f  the device so as to m axim ise the operating frequency. This is 
econom ically  viable for highly general-purpose devices, as the substantial increase in N R E is 
am ortised over a m uch larger num ber o f sales, keeping the unit cost low. However, there is a 
fundam ental lim it to w hich this is possible, referred to as the pow er wall [25]. A nother m ethod 
is to increase the am ount o f available parallelism : the approach used depends on the application 
dom ain, and how com m on particular operations are.
For the m ost general-purpose CPU s, the approach has been to provide m ultiple deeply pipelined 
execution units, each w ith their own queue (superscalar [26]), and use sophisticated hardw are 
to distribute and re-schedule the incom ing instruction stream  in order to keep the pipelines as 
full as possible, using techniques such as branch prediction  [27] and speculative/out-of-order 
execution [28], These hardw are re-scheduling techniques are very costly  in area and pow er 
[29], and place such processors well out o f  the p rice range and pow er budget for em bedded 
system s.
For less general-purpose, dom ain-oriented D SPs, parallelism  is increased by providing m ulti­
ple arithm etic logic units (A LU s)— yielding a fam ily o f  devices called Very Long Instruction 
W ord (VLIW ) processors [30] and their derivatives [31], C om pile-tim e softw are optim isation 
techniques are used to m ake the m ost o f the available A LU s [32]. However, there is a lim it 
to the extent to w hich instruction-level parallelism  can be exploited in general-purpose code 
(the ILP wall [33]), m aking these devices specific to particular types o f  applications, such as 
those that perform  com pute-intensive kernels (inner loops), com m on in digital signal process­
ing. Since such optim isations are static (perform ed at com pile-tim e), they incur no additional 





Som e high-throughput stream ing applications that involve perform ing the sam e operations on 
large sets o f data, involve too m uch state for shared register files (com m on in V LIW s) to be 
able to avoid the m em ory wall [34], In order to m eet the throughput requirem ents, each ALU 
must be given a separate (sm all) local m em ory, m aking it alm ost a fully-fledged processor in 
its own right. This is Single Instruction M ultip le D ata (SIM D): arrays o f  processors, w here a 
single program  counter controls an entire group o f processors, allow ing them  to perform  the 
sam e operation on several d ifferent sets o f  data at once, w here each data set m aintains its ow n 
local state [2], This type o f  architecture is particularly  w ell suited to stream ing, and is becom ing 
ubiquitous in m odern graphics processing units (GPU s) [35, 36, 37].
For larger stream ing applications, particularly  w here there is a  lot o f  control flow, a m ore 
com plex m em ory architecture is required, allow ing point-to-point com m unication  betw een the 
cores. This com m unication  netw ork is often able to feed program s as well as data betw een 
the cores, significantly increasing flexibility. Exam ples include A m bric [1], PicoA rray [38], 
and T ilera [39]. Such processor array architectures tend to use a netw ork-on-chip, o r globally 
asynchronous locally synchronous (GA LS) interconnect to im prove tim ing [40] and to sim plify 
the task o f  program m ing them  (i.e. since these are self-synchronised). F rom  a cost perspective, 
the area overhead places them  well out o f  the reach o f  today’s em bedded system s. However, 
the level o f re-configurability and relative ease o f  program m ing m akes these very interesting in  
other roles (e.g. video com pression, cellu lar netw ork equipm ent). These architectures can also 
be thought o f as a  natural extension to A LU -based coarse-grained reconfigurables (described 
in section 2.1.6), w here control logic and local program  m em ories are added to  each A LU, 
m aking them  becom e full-fledged processors.
2.1.5 Application-Specific Instruction Set Processors
The idea o f custom  data-paths for d irect realisation o f a particular algorithm  can also be applied 
to m icroprocessors. This technology is referred to as application-specific instruction-set pro­
cessors (ASIPs). The custom  data-paths are integrated directly  into the instruction execution 
pipeline, m aking them  appear as special instructions— albeit instructions w ith a com paratively 
high latency. These custom  instructions can im prove the throughput by several orders o f  m agni­
tude, and significantly reduce the pow er consum ption for these operations.4 However, since the 
custom  data paths are hard-w ired, they cannot be reprogram m ed; reprogram m ability  has to  be 
achieved through use o f  norm al instructions executed in series, w hich causes a sharp reduction 
in throughput. Certain application dom ains have sufficient h igh-throughput tasks in com m on 
that it is com m ercially viable to provide A SIPs as an off-the-shelf part targeted tow ards that do­
main. However, they are m ore com m only provided as IP along w ith tools for creating custom  
instructions for the custom er’s particular application (such as those provided by Tensilica [41], 
ARC [42], and Stretch [43]).
A related device fam ily occupies the centre o f the spectrum — reconfigurable A SIPs. These 
replace the hard-w ired custom  instructions o f  a regular A SIP  w ith som e reconfigurable fabric, 
onto which custom  instructions can be rendered [44], This m akes them  sim ilar to an FPGA 
co-processor (to a m aster m icroprocessor), except that the reconfigurable fabric is tightly in­
com pared to perform ing the sam e task w ith an equivalent sequence o f  regular m achine instructions.
10
B ackground
tegrated into the instruction execution pipeline o f the m aster processor. This partly  solves the 
problem  o f getting data in to and out o f  an FPG A  co-processo r’s em bedded m em ory, w hich 
is norm ally the bottleneck in the co-processor approach. The dow n-side is that the m axim um  
data path size is m ore lim ited, w hich reduces the available parallelism . As a result, to m ake 
m ost efficient use o f the available resources, and to justify  the additional area incurred by the 
interconnect, the custom  instructions should be m odified/replaced at run-tim e. Som e sophis­
ticated run-tim e scheduling techniques have been dem onstrated  to  achieve this goal [45] by 
dynam ically  sw itching betw een custom  instructions rendered onto the reconfigurable fabric 
and equivalent sequences o f norm al instructions, according to the current dem and on the sys­
tem. Reconfigurable A SIPs are also relatively sm all in  area, so are m ore suitable to em bedded 
system s.
2,1.6 Coarse-Grained Reconfigurable Architectures
Coarse-grained reconfigurable arrays (CG R A s) also occupy the central region o f the device 
spectrum , but approach it from  the side o f data path m achines [46]. They share sim ilar inter­
connect concepts as FPG A s, but the reduction in granularity  results in few er functional units 
being needed to im plem ent a given task. The increase in bit w idth and reduction in  functional 
unit count both lead to a reduction in area overhead.
The functional units o f  C G R A  fabrics tend to be either A LU  based [47], o r heterogeneous 
[6], H eterogeneous arrays can offer reduced area overhead since the silicon utilisation in the 
functional units per unit tim e can be higher. This is because the m inim um  array size for a given 
design is determ ined by the operations involved in the data paths o f  the largest configuration 
context. An A LU -based array w ould need at least as m any A LUs as there are operations, 
even though all but one o f the operations that an A LU  can perform  go unused (see figure 2.2). 
On the other hand, a  heterogeneous array could  provide largely the sam e operations as are 
actually needed. However, the heterogeneous approach com plicates resource allocation (i.e. 
the developm ent tools), and can lead to longer average path lengths.
2.1.7 Dynamically Reconfigurable Arrays
This sets the scene for a new com er to this centre position— dynam ically  reconfigurable arrays 
(DRAs). These are heterogeneous coarse-grained data-path orientated  reconfigurable architec­
tures, w ith built-in control flow capabilities [48],
By m oving to coarse-grained reconfigurables, the size o f  the configuration context decreases, 
and therefore so does the reconfiguration tim e (assum ing the sam e bandw idth is available). If 
the configuration can be m ade sm all enough, it becom es practical to save m ultiple contexts 
in registers directly in the device. This provides a m uch higher bandw idth, and allow s the 
reconfiguration tim e to drop to nanoseconds. W ith such low reconfiguration tim es, even control 
tasks can be perform ed directly in the reconfigurable fabric. A configuration context no longer 
has to be a  free-standing circuit— the loop body o f  an algorithm  can even be split into a sequence 
o f  configuration contexts, each loaded one after another fo r each iteration.5
’ N.B. only a single batch o f  data is operated on betw een each context sw itch.
1 1
Background
Figure 2.2: ALU-based homogeneous (a) v.s. heterogeneous array (b). Distinct operation types 
are shown in different colours. The active operations are outlined in (a). To realise 
the same data-paths, the heterogeneous array is smaller. However, allocation is more 
constrained in the heterogeneous array, which could result in longer wire lengths (not 
shown).
2.1.8 RICA
The reconfigurable instruction cell array (RICA ) [5, 49], w hich this thesis focusses on, is a 
heterogeneous dynam ically reconfigurable array. A  diagram  o f a sim plified R IC A  array is 
shown in figure 2.3.
The functional units (cells) are chosen to m atch the data w idth and functionality  o f  RISC in­
structions in a typical C com piler. R IC A  uses the concept o f  d istributed registers— a significant 
fraction o f  the instruction cells are registers. The function units are connected together via an 
island-based interconnect network, w ith an array o f  sw itch boxes. T he array uses a H arvard 
m em ory architecture (to m axim ise bandw idth)— the program  m em ory and data m em ory are 
separate. In m any application dom ains, the array can also have special-purpose stream  m em ­
ories (line buffers) w hich further increase the on-chip bandw idth. D ata can be passed in to 
and out o f  the array either via the data m em ory (m em ory-m apped I/O), o r via special-purpose 
interface cells; the choice o f w hich depends on the application dom ain and array size.
The array is in control o f its own reconfiguration: a special cell in the core— the ju m p  cell—  
provides access to the program  counter and allows the data paths to influence program  control 
flow. Since the structure o f the core allows arbitrarily com plex data paths to be constructed 
betw een the available function units,6 the com binatorial delay o f  the critical path in each con­
figuration context m ay be very different. To account for this, a reconfiguration rate controller 
(RRC) is used to control the length o f  tim e (num ber o f  m aster clock cycles) for w hich the con­
figuration context persists. This value is stored as part o f each configuration context. W hen the
'’subject to resource availability and routing considerations.
12
B ackground
Figure 2.3: Simplified example of a reconfigurable instruction cell array (RICA). The reconfig­
urable interconnect is shown in grey.
RRC expires, the state o f  the synchronous cells (such as registers) is updated, and the config­
uration context referenced by the program  counter is loaded. The program  counter may refer 
to the sam e step as had ju s t ended, in w hich case that configuration context persists for another 
iteration, w ithout incurring any transactions from  program  m em ory, o r any reconfiguration de­
lay. This is the m ost efficient way to execute loops on RICA ; such single-context loops are 
called kernels. If, in a given configuration context, the next value o f  the program  counter can be 
predicted in advance,7 the configuration context m ay be pre-fetched, thus reducing the context 
sw itch tim e. This occurs in configuration contexts that perform  an unconditional jum p to a 
constant address.
7i.e. if it doesn’t depend on values read from  the data path in that configuration context.
13
B ackground
2.2 Programming Methodologies for Reconfigurable Architectures
Looking at another property— that o f  design effort for an O EM  using these devices— the tech­
nologies on the left o f  the spectrum  shown in figure 2.1 are designed/program m ed using hard­
w are description languages (HD Ls). T hese offer relatively little abstraction, and expose the 
application designer to  a lo t o f  low-level book-keeping issues such as clock dom ains and com ­
m unication tim ing. As the nam e im plies, these languages are intended for describing data  paths 
and logic— i.e. physically  realisable, static circuits.
On the other hand, the various C PU -derived technologies that occupy the right o f  the spectrum  
are program m ed using high-level languages such as C. This norm ally equates to significantly 
low er developm ent tim es. However, it should be noted that hardw are acceleration features such 
as the custom  instructions on A SIPs are designed using HDLs.
D ata-path reconfigurable architectures— FPG A s through to coarse-grained arrays— have pri­
m arily com e from  the A SIC  w orld ,8 and as a result have m ainly be program m ed using tools 
that w ork from  H D L o f som e form  or another. In general, the m ore fine-grained an architec­
ture, the larger the configuration size, and thus the longer it takes to reconfigure the device at 
run-tim e. W ith such a lim itation, applications have to  be m apped either as a static configura­
tion (all-in-one), o r as a set o f  configurations that each m ust persist for m any m illiseconds (or 
seconds) at a tim e. T hese long-living configurations have to be free-standing circuits in order 
to be able to do useful work, and so are still best described by HD Ls.
T he potential for silicon re-use9 provided by dynam ic reconfiguration has spurred a lot o f  re­
search into sim plifying the task  o f  efficiently m apping a design into a sm aller device, using 
tim e-division m ultiplexing o f the resources. This includes active research in extending H D Ls 
to support dynam ic reconfiguration (JH D L [50]), and softw are approaches to schedule when 
a configuration is loaded into a shared FPG A  resource, to m axim ise the useful w ork that can 
be done by each configuration. This attem pts to w ork around the problem  o f the reconfigu­
ration tim e. However, the large tim e scale involved in the reconfiguration m eans that it is still 
im practical to perform  control tasks in the reconfigurable device itself; a m aster CPU is needed.
The key feature o f  DRA  architectures like RIC A  is to m ake use o f  this by giving the array 
control o f its own reconfiguration, and thus rem oves the need for a m aster CPU. This com bi­
nation o f real-tim e arbitrary control flow and being able to tim e-division m ultiplex a loop body 
m akes this architecture suitable to be program m ed from  a high-level softw are program m ing 
language like C. This essentially  m erges the two dom ains o f data-path orientated design (HDL) 
and sequential design (softw are program m ing languages). The main drive for this approach is 
trying to achieve the throughput (and low area) o f coarse-grained data-path architectures, with 
the ease o f use o f m icroprocessors.
8and m arketed as soft ASICs.
lJand thus the reduction in area for a given design.
14
Background
2.2.0.1 H D L Based Tool F lows
Tool flows based on hardw are descrip tion  languages consist o f  an H D L  sim ulator and synthesis 
(place& route) tools. Exam ples:
•  H D L to gates.
•  H D L to configuration b itstream  (e.g. for FPG A s).
•  JH D L  to set o f  configuration bitstream s (for dynam ic reconfiguration o f  FPG A s).
H D L tool flows suffer from  an inherently  longer developm ent tim e than high-level softw are 
languages. This is a result o f  the low er degree o f  abstraction. A lso, due to the search space being 
large, p lace& route tools take a long tim e to run, fu rther leading to  long developm ent cycles. 
S im ulation is also slow, due to having to model the en tire m achine— since the language models 
the target m achine at a low-level, this is the level at w hich sim ulation m ust be perform ed. This 
will rem ain to be the case until tools exist w ith sufficient in telligence to recognise (and therefore 
m odel) m ore abstract, h igh-level com ponents, from  the low-level description. The prim ary 
advantage o f H D L  is the ability to describe operations that directly  conform  to a particular tim e 
base; high-level softw are languages have no notion o f absolute time.
2.2.0.2 H igh-Level Softw are Based Tool Flow s
Typical softw are tool chains consist o f  a com piler, sim ulator, and som e kind o f  back-end. E x­
am ples:
•  C to  assem bly. The assem bly norm ally m atches the underlying instruction  set o f the 
target m achine. T he back-end is a linker.
•  C to gates. The back-end norm ally generates H D L for subsequent place& route.
U se o f  h igh-level languages is norm ally only possib le w hen targeting m icroprocessors or sim i­
lar state m achines. The flexibility introduced to allow  for com plex control flow in these archi­
tectures results in  them  having relatively poor throughput. As discussed in section 2.1.3, recent 
trends in m icroprocessor designs have lead to ways o f  increasing throughput by the introduction 
o f pipelines. However, these have the effect o f  in troducing ‘m om entum ’— changes in control 
flow lead to significant reductions in throughput. This how ever has also been tackled through 
branch prediction. However, all o f  this com es at the cost o f  substantial silicon area.
C to gates tools (such as [51J or C atapult C [52]) take advantage o f the fact that m ost DSP 
algorithm s have sim ple control flows, w ith relatively large areas o f  com putation  w here there is 
a significant degree o f  parallelism . Each node o f the control flow graph is m apped in silicon. 
This approach has the disadvantage o f  poor silicon re-use, even when partial reconfiguration is 
u sed .10 Therefore, the m ain overhead o f this approach is area. T he presence o f  large data paths 
and long connections betw een blocks also sacrifices throughput. Since H D L is the back-end,
uldue to the high configuration tim e of fine-grained architectures.
15
Background
final rounds o f testing (from  HDL) exhibit the sam e developm ent cycle as w ith designs that 
w ere H D L from  the beginning.
The recent addition o f declarative approaches such as H um e [53] have provided another di­
m ension to this, by allow ing applications to be designed from  a num ber o f different levels: 
hardw are-level (sim ilar to H D L), tem plate/skeleton-level (joining together pre-defined blocks), 
and high-level (turing com plete). This allow s the designer to m ake a d ifferent trade-off be­
tw een ease o f  description, provability, and efficiency fo r each  part o f  the design, and to later go 
back and revise this decision. These are particularly  useful in resource-lim ited  or safety-critical 
system s.
2.2.1 Re-targetable Toolchains
D ifferent C PU  architectures exhibit m any com m on properties, w hich tools can take advantage 
of: re-targetable toolchains can be created w hereby support fo r a new  architecture is added 
by describing the au tom aton11 and to a certain  extent stating w hich properties it exhibits. The 
tools can then autom atically  generate an optim ising com piler, linker and sim ulator for the target 
architecture. Exam ples o f  such re-targetable toolchains are LISA  [54] and E xpression [55].
However, in order to take advantage o f operation chaining in reconfigurable com puting ar­
chitectures such as D RA s (section 2.1.7), many o f these assum ptions no longer apply. If  the 
assem bly instructions correspond to operations supported by cells in the core, coarse-grained 
reconfigurable com puting architectures can execute code described in norm al assem bly in a 
purely sequential m anner ju st like a m icroprocessor, w ith one operation per configuration con­
text. B ut this would result in very poor core utilisation. E xtending this w ith m ultiple issue 
assem bly (as used w ith V LIW s) im proves this, but still doesn’t take advantage o f operation 
chaining, again leading to poor core utilisation.
2.2.2 Supporting Operation Chaining
In regular assem bly, each line represents a  change o f m achine state. W ith norm al m icropro­
cessors, each line consists o f a  single instruction, so the state change corresponds solely to that 
introduced by that instruction. For use w ith m ultiple issue V LIW s, the lines o f  the assem bly 
can contain certain groups o f m ore than one instruction. The state change is therefore that at­
tributed to all o f these acting together [56]. Since the state m ostly consists o f  the contents o f the 
registers, and each line o f assem bly corresponds to a single change in state o f the registers, this 
m eans that it is not possible to chain together the operations in a group, since that w ould require 
m ultiple accesses to the register used to jo in  the operations. Furtherm ore, the latency w ould be 
changed. However, some m odern V LIW s (or ULIW s [31]) provide lim ited com binations o f 
operation chaining, w hich avoid the use o f  registers, and instead rely on physical connections 
betw een the FUs. Again, this is fairly lim ited, although does increase the possib ility  o f being 
able to create a single line o f  assem bly that loops back to itself, thus avoiding fetching from  
m em ory and decode time.
"reg iste r  classes, supported operations, and pipeline geometry.
16
B ackground
To support full and arbitrary operation chaining, we m ust expand the range over w hich a state 
change is expressed: instead o f  m irroring the state change at each line o f  the assem bly, we 
should m irror the state change on a larger scale. To m axim ise the available parallelism , we 
w ant to  choose as large a range as possible. In order to still m eet the control flow dem ands, 
the sensible solution is to choose the basic block— i.e. the resulting configuration (or sequence 
thereof) o f  the target m achine should result in the sam e sequence o f  state changes as those 
corresponding to the end o f  each basic block in the assem bly .12
T he com piler has to be modified to ensure that new block labels are placed after each jum p, 
so execution cannot leave part-w ay through a basic b lo ck .13 To help m axim ise core utilisation, 
optim isation steps have to be m odified to attem pt to m axim ise the size o f  each basic block [57]. 
For m axim um  throughput, each  basic block should m ap to a single configuration context on the 
target array. However, resource availability and other tim ing constrain ts may prevent this from  
being possible. It is the task o f  the scheduler to identify these constraints, and to  distribute the 
operations o f  the basic block across m ultiple configuration contexts w here needed. This will be 
addressed in chapter 4.
2.2.3 Working From Assembly
W ith m inim al changes to  a conventional re-targetable com piler, it is possib le to  target archi­
tectures that support high degrees o f  operation chaining. This is done by representing  the 
functional units by instructions w ith identical functionality, and use the concept o f the basic 
b lo ck 14 as the basis fo r constructing one or m ore configuration contexts.
The assem bly for a basic block describes all connections betw een operations in the data flow 
graph, using registers as the transport m edium . In the ideal case w here a basic block m aps to 
a single configuration context, all such internal connections are in fact achieved through the 
interconnect (i.e. w ires), avoiding the register file com pletely. In o rder to achieve the correct 
state o f  the register file after executing a basic block, it is only necessary to w rite the final value 
to each register. A ll other values w ritten to registers serve only to  identify the connectivity, and 
are turned into w ires, o r allocated to tem porary registers over the boundaries in the resulting 
configuration contexts. However, there is no direct way o f  expressing this to the com piler. 
Furtherm ore, the com piler has to know  how m any registers the target supports, in o rder to 
determ ine w hich edges in the data flow graph m ay be stored in m achine registers, and which 
on the stack (data m em ory).
If  w e specify too few  registers, the com piler will m ake needlessly heavy use o f  data m em ory 
(the stack) to store values. This reduces para llelism ,15 and has a significant effect on latency. 
It is undesirable to have stack tracking built into the scheduler, as this im poses a need for too 
m uch m achine-specific know ledge; w hich reduces re-targetability. If  w e specify an infinite (or 
very large) register count, it m ay be possible for the com piler to generate basic b locks with
l2i.e. the state after executing the last instruction o f the basic block.
l3to ensure that the operations perform ed are the sam e irrespective o f  the direction o f  control flow, since the order 
in the assem bly may not be adhered to.
l4groups o f  instructions w ith no control (low in-betw een.
I5since dependent m em ory operations have to occur in m ore or less the sam e order as described in the assem bly, 
im posing additional configuration contexts.
17
B ackground
too many independent data paths for their initial and final values to  be brought in to  and out 
o f  the basic block via real registers in the core. This w ould result in the basic b lock  not being 
physically realisable in the core. T he responsibility  o f  deciding w hich edges in the data flow 
graph are stored on the stack could be given to the scheduler, but again, this w ould result in 
reduced retargetability.
O ne inefficiency o f  using the assem bly representation therefore is the needless change in value 
o f certain  registers that are used only to store tem porary  values that should really ju s t be wires. 
However, by analysing the lifetim e o f  each value stored in each register betw een basic blocks 
throughout the entire p rogram ’s control flow graph, it is possible to determ ine w hich registers 
are live, and w hich ju st store tem poraries. This inform ation can then be used to avoid w riting 
values to dead registers, and thus help reduce power, and free m ore registers for o ther uses by 
the scheduler. In particular, this tends to  reduce the chance o f register starvation during multi- 
step scheduling, since final results that are never used do not tie up a register all the way through 
until the last configuration context generated from  the basic block. A m ethod for doing this is 
described in section 4.7.
T he m ain w ork o f  this thesis— creating re-targetable toolchains for rapidly dynam ically  recon- 
figurable architectures— chooses to w ork from  assem bly. T he m ain reason for this decision is 
prim arily  developm ent tim e: the extensive set o f  existing language front-ends and m iddle-end 
optim isations o f  G CC can be leveraged, w ithout having to extensively m odify the internals o f 
the com piler. Sim ply w riting a back-end is m ostly sufficient.
A n alternative approach w ould be to use the com piler’s interm ediate representation: TreeSSA  
[58, 59J. However, this was not in a stable state at the tim e o f starting the w ork on the schedul­
ing. U sing TreeSSA  m ay m ake m ore inform ation available for scheduling, e.g. m em ory alias 
set inform ation, w hich could reduce the num ber o f  configuration contexts produced from  basic 
blocks that involve a lot o f data m em ory activity.
W orking from  the assem bly has the advantage o f  allow ing us to perform  scheduling in a  sepa­
rate tool, which:
•  Gives m ore flexibility in adding features (e.g. D FG visualisation, optim ised linking).
•  Reduces build tim e (GCC is slow to com pile).
•  A llows us to change to a d ifferent com piler, w ith little extra work. This allow s a range 
o f  com pilers to be com pared, and speeds the adoption o f a new com piler if  one is found 
to have m ore appropriate optim isation passes (e.g. LLVM [60], w hich em erged after this 
work began).
•  A llows the scheduler to be used as part o f  a com m ercial product, w ithout having to 
release the source code, as dem anded by the G PL used by GCC.
Background
However, it also has the follow ing disadvantages:
•  M uch o f  the high-level inform ation constructed by the com piler is discarded before 
reaching the back-end. This m akes it difficult to im plem ent optim isations that affect 
control flow and m em ory layout.
•  The com piler doesn ’t have direct know ledge o f the target architecture, so is unable to 
m ake inform ed decisions about how best to form  basic b locks that best m atch the re­
sources available. This can lead to low er parallel efficiency.
A dditionally, p ragm a com piler directives cou ldn ’t be used w ith G CC , since they apply at the 
function level, and not at the instruction  or basic block level, w here the inform ation is needed 
for the types o f  optim isations looked at in this work. However, an alternative approach was 
devised for use w ith G C C — certain inform ation such as pipelin ing requests could be encoded 
as volatile inline assem bly hidden behind a hum an-readable m acro. This inline assem bly ends 
up in the sam e basic block as the operations that it is intended to affect.
Com piler optim isations can be independently  developed to im prove the generation o f  basic 
blocks that are m ore suitable for use in data path architectures. Future w ork could integrate the 




The w ork in this thesis targets the reconfigurable instruction  cell array (RICA ), w hich is in tro­
duced in section 2.1.8. The com plete toolchain for w orking w ith RICA  is shown in figure 2.4.
User Input
Figure 2.4: Complete RICA toolchain: hardware and software generation. The primary tools and 
files worked on in this thesis are highlighted in red.
RICA  represents a fam ily o f  related architectures. P rior to the onset o f the w ork described in 
this thesis, the basic RICA  architecture consisted of an array o f  32-bit cells, each connected 
to a sw itch box (sbox). The sw itch boxes are connected in a sim ple 2-D  grid, w ith 32-bit 
unidirectional interconnect. Each o f  the four directions has both an input and output channel, 
connecting it to the neighbour.
A basic set o f cell types w ere provided, w hich closely m atch typical RISC instructions (e.g. add, 
multiply, shift, logic, etc.), along w ith som e special-purpose cells such as data m em ory access 
(rmem/wmem), im m ediate constants (const), and program  flow control (jump). These cells 
are often referred to in the text by their corresponding instruction nam e, expressed in block 
capitals (e.g. ADD). In som e variants o f the core, certain  prim itives w ere com bined (e.g. add 
and comp —> addcomp, instruction nem onic ADDCOMP). T he cells, sw itch box, and support 











setup 0 // input setup time . combinatorial cells have 0 setup time.
















setup 15 // 0.15ns.














Figure 2.5: Machine description file (MDF) syntax before the work of this thesis. Cell types and 
their properties and functionality are hard-coded into the tools; just the port layout, 
timing, instance counts, and arrangement in the core are controlled by the MDF. There 
is a fixed 1:1 relationship between instruction names and cell type names.
T he m achine description file (M DF) was created to describe the individual cell counts and w hat 
locations they occupy in the array. T im ing inform ation and port layouts w ere also described 
in the file, for use by the different tools. Each project w ould typically have its own M D F 
associated with it. A sim plified exam ple is shown in figure 2.5. An array genera tor  tool was 
provided to generate a com plete Verilog model o f  a RIC A  array, according to a particular MDF.
To generate softw are to  run on the array, a com piler based on G CC w ith a custom  back-end 
had been created, w hich allow ed C code to be com piled into a R ISC -like assem bly. A series o f 
scheduling tools w ere then used to batch the resulting assem bly instructions into configuration 
contexts, to create an abstract netlist. This abstract netlist describes ju s t the connectivity be­
tween cells, w ithout describing how these connections m ap to actual paths on the interconnect.
21
B ackground
A routing tool [61 ] exists to render these paths, creating a m u ted  netlist. T he routed netlist can 
then be passed to a bi stream  generator, to generate the configuration bit stream  that w ould be 
placed in the RICA co re’s program  memory.
This approach o f using an existing com piler w ith a custom  back-end, then perform ing schedul­
ing in a separate tool, was a pragm atic choice designed to get som ething w orking as quickly as 
possible. GCC provides standards-com pliant language front-ends, and a w ealth o f  optim isation 
passes, all ready to use out-of-the-box. The O penR ISC  G CC back-end was used as the basis 
for the RICA back-end, since its instruction set best m atched the functionality  o f RICA’s cells. 
The m odifications involved adding support for m ultiplexers (m apped to C ’s conditional opera­
tor and ensuring that labels w ere placed after each jum p instruction, so as to ensure that 
control flow always starts and ends at the beginning and end o f a basic block. The scheduling 
being perform ed by a separate tool allow ed the flexibility o f  generating any arbitrary file for­
mat (the R IC A  netlist in this case), and gave freedom  to explore ideas w ithout being restricted 
by the confines o f a pre-existing fram ew ork. O riginally, the scope o f the scheduler was very 
lim ited, so this approach was the quickest to im plem ent. Later, attem pts w ere m ade to leverage 
G C C ’s own scheduling and register allocation algorithm s; how ever this proved difficult and 
lead to poor results.
O riginally, to test softw are targeting a RICA  array, the entire array would have to be sim ulated 
in a conventional H D L sim ulator tool. This w ould require the com plete softw are toolchain to 
be run, and the bistream  loaded into the m odel o f  the R IC A  array. L ater on, to reduce the 
design iteration tim e, a high-level sim ulation tool (sim ulator) was provided. This operates on 
the netlist, and sim ulates the behaviour o f the array, generating execution profiles and a m em ory 
dum p, for debug purposes. The sim ulator is a System C behavioural model o f the target array, 
w hich is faster at running than a full RTL sim ulation o f the entire array. Also, the sim ulator 
could run on either the abstract netlist o r the routed netlist: if  accurate tim ing inform ation is 
not im portant, the abstract netlist can be used to test the behaviour o f  the program , w ithout 
having to go through the lengthy process o f routing each configuration context. This reduced 
the design iteration tim e for typical sm all program s from  hours to minutes.
T he design space was defined by ju s t the instance counts o f each cell type, and their location 
in the array. D isabling the use o f  certain cell types, or m odifications to the instruction set, 
required rew riting the com piler back-end and the scheduler. Furtherm ore, it was only possible 
to describe com binatorial cells, or synchronous cells w ith no state. Registers w ere dealt w ith as 
a special case.
2.3.1 The Work of This Thesis
T he w ork described in this thesis expands this design space, by generalising the description o f 
the array. This involved extending the m achine descrip tion  file (M DF) syntax, and passing this 
inform ation to the various tools. The com piler was m odified to read the M DF, to set the register 
count and disable/enable the use o f  particular expansion patterns according to the resources 




RRC step field width = 10, // i of bits representing the step persistence time.

















































cell Sbuf (volatile, disjoint)
instruction SRBUF (must be in first pipeline stage, side effects);
instruction SWBUF (must be in last pipeline stage, side effects);
instruction SRBUF_RAM (latency=3);





merge "'S BUF_S ET_READ" + " 'SBUF_SET_WRITE"
=> " 'SBUF_SET_READ_WRITE";




Figure 2.6: Machine description file (MDF) syntax after the work of this thesis. Cell types can be 
freely described, along with an arbitrary mapping of (multiple) instructions to each 
cell type, allowing support for cell type aliasing and disjoint operations (i.e. cells that 
output data independently to their inputs). MDFs can now be cascaded. Only the base 
(‘features’) MDF is shown here.
23
B ackground
O ne o f the first m odifications done to the M D F form at w as support for cascading— w here one 
M D F can inherit from  and specialise another. This was used to partition the inform ation into 
three layers:
Features: T he ports o f each cell type, the instructions associated w ith them , and any special 
properties that affect their scheduling.
Process: Tim ing inform ation for the in terconnect and cell types, based on a particular m anu­
facturing process node.
Target: Cell instance counts and their layout in the core, fo r a particular R IC A  core IP.
This allows com m on inform ation to be shared betw een m ultiple designs, im proving the m ain­
tainability  and readability. F igure 2.6 shows a fragm ent from  the ‘fea tu res’ layer o f  the new 
format.
The concept o f the scheduler was generalised, so that it could apply to a w ider range o f  devices. 
This involved devising a data m odel that was instruction  agnostic16, and adding support for 
m ore com plex concepts such as the partial aliasing o f  functionality  betw een different cell types 
(e.g. MUL in figure 2.6, w hich is supported by Mul and Mul64), m ultiple instructions repre­
senting the input and output side o f  a particular ce ll,17 (e.g. SRBUF for reading from  a stream  
buffer and SWBUF for w riting to a stream  buffer, shown in figure 2.6), o r internally  pipelined 
cells (e.g. SRBUF_RAM in figure 2.6 w hich is internally  pipelined into 4 stages). Properties 
were defined for each instruction to define any special constraints that affect their scheduling 
or how they can fit into pipelines.
Furtherm ore, to accom m odate larger arrays, la ter designs increased the num ber o f in terconnect 
channels. The M D F syntax w as later extended to allow  the entire in terconnect topology to be 
described, and this inform ation used by the tools to explore alternative interconnect topologies, 
fo r im proved routability, perform ance, and area.
The data model o f the scheduler was further utilised to perform  m ore com plex m odifications 
such as assem bly-level optim isations, and pipelining. The pipelining w ork has been published 
on two occasions [10, 11 ], w hich are both attached in appendix D.
In addition to this, this thesis describes algorithm s that w ere used to construct a high-speed 
em ulator that could be used as a drop-in replacem ent for the sim ulator, so that m ore com plex 
program s could be tested. The em ulator executes target code several orders o f  m agnitude faster, 
so further reduces the design iteration tim e from  m inutes to seconds for sim ple program s, and 
m akes it feasible to test m uch m ore com plex program s or larger arrays, and to realistically 
im plem ent feedback-directed optim isation. This w ork was published [12], and is attached in 
appendix D.
The overall effect o f this w ork is that the tools can now address the entire scope o f w hat RICA  
can be. A dditionally, the tools are now fully autom ated. This m inim ises the design iteration 
tim e, turning these tools into a com petitive product for hardw are/softw are co-design.
I6i.e. could support any type o f  instruction.




As the scheduler’s role becam e increasingly sophisticated, the lim itations o f  w orking from  as­
sem bly becam e apparent— certain optim isations need high-level inform ation from  the com piler 
w hich are lost before reaching the back-end. Exam ples o f  this are m em ory alias sets, loop con­
structs, and the m apping o f  variables to registers and the stack. F uture w ork will m ove a lot 
o f  the optim isations presented in this thesis into higher levels o f the com piler, w here m ore 
inform ation is available. This gives further advantages in term s o f the ability to re-direct the 
structure o f  the code generated according to know ledge o f how w ell the basic blocks will m ap 
to the target architecture, and the ability to autom atically  m ap variables to different types o f 




This chapter presented the background know ledge and related literature to set the scene for 
the rest o f  the thesis. C om puting architectures w ere looked at and presented as a spectrum , 
which reflects the perform ance v.s. flexibility trade-off inherent in each architecture. A general 
observation here is that reconfigurable com puting— i.e. w here a data path m achine is used 
to perform  generic com puting tasks— generally  consists o f  a data path m achine coupled with 
a m icroprocessor. This involves partitioning the program , and often requires each part to be 
w ritten in a different language.
The related softw are program m ing m ethodologies w ere also presented, show ing how  languages 
com e from  tw o cam ps: the A SIC  design w orld (H D Ls) and general com puting w ith m icropro­
cessors (high-level languages). R ecent developm ents in H D Ls have allow ed them  to express 
m ultiple configuration contexts, w hich helps them  express dynam ic reconfiguration, w hich is 
com m on in reconfigurable com puting. C om ing from  the other side, there is a lot o f  recent 
w ork in going directly  from  high-level languages to data path m achines (i.e. C -to-gates). This 
supports rapid application developm ent, w hilst sacrificing area efficiency.
Previous w ork leading up to  the w ork o f  this thesis was described— the R IC A  architecture— a 
data path m achine that can control its own reconfiguration, and can be reconfigured rapidly 
enough to perform  arbitrary control flow. This avoids the need for an accom panying m icropro­
cessor, and allows an application to be deployed as a single code base in a single (high-level) 
language.
The next chapters describe the w ork o f  this thesis: algorithm s and m ethodologies used to  im ple­
m ent a tool chain that allows applications to be efficiently deployed from  high-level languages 




O ne o f  the advantages o f the reconfigurable cell based processors that are the target architecture 
for the w ork described in this thesis, is the ir ability to be tailored to particular application 
dom ains. The potential search space is very large— encom passing physically  realisable designs 
w here the m etrics o f potential throughput and area for a given target application dom ain can 
vary by several orders o f m agnitude.
S im ulation is needed  to  allow  for rapid  m odification and evaluation o f  the core design, avoid­
ing the tim e needed to re-im plem ent and test the core using a hardw are description language 
(HDL) for an FPG A  im plem entation, or the cost o f re-fabricating the array. Furtherm ore, these 
architectures are intended to be provided as flexible IP b locks, w here the end-user can m ake sig­
nificant changes to the m ake-up and functionality  o f  the core. T he end-user expects a com plete 
toolchain to be available that is able to reflect these changes, in o rder for the com plete hard­
w are/softw are design space to be explored. Such a toolchain norm ally consists o f  an optim ising 
com piler, and a sim ulator [54, 55], The application dom ains that these architectures are m ainly 
aim ed at tend to operate on large data sets, such as video playback (H .264 decoding [3]), digital 
signal base-band processing [4], and im age signal processing [10], As a result, sim ulation tim e 
is a crucial factor in determ ining the length o f the architecture definition cycle, and thus tim e to 
m arket. Finally, a high-speed sim ulator is necessary to provide feedback-directed  optim isation 
as a standard part o f  the toolchain.
3.0.0.1 A im s
o P rovide a softw are sim ulator to allow  the design search space to be explored w ithin a 
reasonable tim e fram e.
•  A llow  rapid application developm ent and validation.
3.0.0.2 O bjectives
G eneric: It m ust be easy to describe the target architecture (e.g. resource counts, tim ing fig­
ures), and the sim ulation adapt accordingly.
Extensible: It m ust be easy to add new functionality  (e.g. cell types), preferably using a high- 
level description.
Fast: T he sim ulation should be as close to real-tim e as possible.
A ccurate: T he sim ulation should behave as m uch as possib le like the target architecture at a 
given level o f  abstraction, and should give a reasonable estim ate o f the tim ing (i.e. w ithin 
an order o f  m agnitude).
27
Em ulation
This chapter presents a softw are em ulator that was developed for this class o f  architecture, 
satisfying the above goals. Section 3.1 presents an overview  o f  existing em ulation/sim ulation 
strategies used for reconfigurable architectures and data path m achines, and o f  m icroprocessors.
D ifferent sim ulation techniques are fo r d ifferent purposes: a finer granularity  o f state coherence 
is needed for early stages o f  developm ent o f  a target architecture, to perform  m ore detailed 
analysis o f its behaviour, to determ ine if and w hen it deviates from  the intended specification. 
However, once this has been sufficiently tested, this level o f detail is no longer required , and 
instead the focus (and role o f sim ulation) shifts from  that o f  architecture/tool developm ent to 
target application developm ent. In this latter role, execution speed is o f  param ount im portance.
Instruction-accurate em ulators are the fastest form  o f m odel available for m icroprocessors. This 
is m ostly because the state only needs to be m odelled as each instruction is executed (i.e. not 
necessarily  cycle accurate). This is only possib le because the processors have been designed 
to make the state be consistent on this granularity, and w ith sufficient know ledge o f  processor 
internals, m ost other inform ation can be derived from  this.
3.0.0.3 N ovelty
The novelty o f  the w ork presented in this chapter is in providing high speed em ulation of 
a self-controlling reconfigurable data path engine, by m aking the reconfigurable data paths 
look like a regular instruction stream — Load-tim e serialisation  (section 3.1.3). This is done by 
first classifying the operations, and then applying a topological sort to perform  serialisation—  
Serialisation algorithm  (section 3.3.2). This serialisation is perform ed at load-tim e— i.e. only 





T he task o f  softw are sim ulation o f  a com puter architecture involves the m odelling o f  specific 
m achines on a general-purpose (sequential) com puter. D ifferent techniques differ in the g ranu­
larity o f  w hen the state o f  the sim ulation m atches the state o f the target hardw are (state coher­
ence). This granularity  is chosen according to the level o f  abstraction that is appropriate to w hat 
the m odel is going to be used for. For instance, if  the purpose o f  the model is to debug the op ­
eration o f  the hardw are, then a full register transfer level (RTL) sim ulation is usually  necessary. 
W hen debugging developm ent tools for the target architecture— w hich likely require accurate 
tim ing in order to validate their output— a full RTL sim ulation is probably overkill, and a m ore 
abstract m odel that encom passes the tim ing o f the system  w ould be appropriate, such a Sys- 
tem C  m odel supplied w ith tim ing inform ation extracted from  RTL sim ulation. Finally, when 
debugging applications that are to  run on the target architecture, a further level o f abstraction is 
acceptable, w here only the functionality  and rough tim ing needs to be captured. This last class 
o f  m odel is called an emulator.
3.1.1 Background: Emulation
Figure 3.1: Modelling a serial machine on another serial machine: each instruction in the tar­
get architecture’s instruction set (left) is modelled by an equivalent instruction (or 
sequence of instructions) in the host architecture’s instruction set (right). Blue boxes 
represent instructions, and the red lines show where the state of the two machines is to 
match. In emulation, the state must match at the end of each instruction of the target 
architecture.
Softw are-based em ulation o f  m icroprocessors has been used since at least the 1970s [62], Em ­
ulation m odels the instruction set o f  the target architecture by m im icking the way that the state 
o f  the CPU, registers, and m em ory is affected by each operation in the instruction set. The fetch 
and execution o f  instructions in the em ulator is perform ed in the sam e sequential m anner as in 
the target CPU. The concept is show n in figure 3.1.
29
Em ulation
Traditionally, such em ulators have been custom -built to a particular target architecture and 
platform  [63]. Since m ost CPUs are conceptually  sim ilar, these concepts can be abstracted, 
m aking the em ulator extensible. This is com m only  achieved through object-orientated design 
[64, 65], Em ulators are part o f many m odem  com m ercial tool sets [66]. E m ulation  sees the 
follow ing uses:
B ehavioural validation: the target architecture and associated application developm ent toolchain 
can be proven before com m itting  to silicon, or dedicating tim e to detailed H D L sim ula­
tion.
Product/A pplication dem onstration: the ability to add em ulated  hardw are allow s for appli­
cations to be dem onstrated in near real-tim e, before the hardw are is available.
Provides an easily m odifiable test bench: adding em ulated hardw are at the behavioural level 
aids in developing peripherals, since these can be added to the em ulator, and their use­
fulness o r interface design explored. This m akes it is easy to try out new  ideas (platform  
exploration), w ithout having to design them  beyond the behavioural level.
R educes developm ent tim e: algorithm s can be tested and tim ing inform ation estim ated in a 
fraction o f the tim e o f  o ther softw are-based sim ulation techniques available.
Feedback-directed optim isation: inform ation can be extracted about a program  through pro­
filing during execution on the em ulator. This inform ation can then be used by a com piler 
[56] to m ake m ore inform ed decisions w hen applying optim isation [67].
T he generalisation o f traditional em ulation concepts has also extended to the point w here em ­
ulators can be autom atically generated from  an abstract m achine description, along w ith an 
optim ising com piler/scheduler as part o f  a retargetable toolchain [54], M achine description 
languages have progressed to the extent that features o f  increasingly com plex architectures can 
be captured, including deep pipelining o f  functional units, m ultiple instruction  issue, and the 
design o f the m em ory subsystem  [55], However, these languages are not yet able to capture the 
operation chaining available in reconfigurable processors, except by enum erating every possi­
ble configuration, w hich w ould be im practical. However, such languages could be extended 
to capture this inform ation, and such a description could be used to autom atically  generate a 
sim ulator using the technology presented in this thesis.
D evelopm ents in m odern com piler technology have exhausted m uch o f the potential for static 
optim isation, and so the trend is a shift towards feedback-directed  optim isation. As a result, an 
em ulator for this purpose is likely to becom e a significant part o f  standard toolchains. W ith this 
in mind, the speed o f sim ulation directly affects the scalability o f the toolchain with respect to 
target applications, w hich are o f ever increasing com plexity. H ardw are acceleration has been 
com m only explored for use with em ulation [68, 69, 70]. However, several o f the uses listed 
above make the requirem ent o f  additional hardw are undesirable (if not im practical), and so a 
softw are-only solution is the m ain focus o f this thesis.
30
E m ulation
3.1.2 Background: M odelling Data Path Parallelism
The ability  o f  reconfigurable instruction cell based architectures to execute arbitrary  control 
flow m akes them  sim ilar to m icroprocessors, if  w e consider the state changes to be on the in­
struction level. A  configuration context is analogous to an instruction— but one that isn ’t part o f 
a fixed instruction set, per se. This sim ilarity m eans that softw are-based sim ulation technolo­
gies traditionally  used w ith m icroprocessors can be adapted for reconfigurable instruction  cell 
based architectures, by extending them  to take into account the parallelism  in the array. H ow ­
ever, reconfigurable architectures support operation chaining— the ability  to execute dependent 
and independent instructions w ith in  the sam e clock cycle/configuration context— w hich trad i­
tional em ulation technology cannot model.
Figure 3.2: Modelling combinatorial data paths on a serial machine: the data paths (left) are bro­
ken up into a sequence of instructions (blue boxes) in the host architecture’s instruction 
set (right). The red lines show where the state of the two machines is to match, which is 
at the end of each complete iteration of the data paths, when their outputs settle. Many 
of the instructions shown are part of the HDL simulation kernel, which generates the 
sequence of instructions corresponding to each operation at run-time, according to the 
events generated.
M odelling parallelism  on a serial m achine has already been addressed in  H D L  sim ulation, 
particularly those intended for dynam ic reconfiguration [50]. The overall concept is shown in 
figure 3.2. These concepts can be borrow ed to derive an event-driven m odel that captures the 
data paths betw een processing elem ents in the array. System C  provides an object-orientated 
event-driven m odel— called transaction level m odelling (TLM )— w ith a kernel sim ilar to  an 
H D L sim ulator, but described only at the behavioural level in C.
Such data path m achine sim ulators are slow because they seek to m odel the target architec­
ture on a very fine granularity  (i.e. per operation). The generation and processing o f  events 
introduces an overhead each tim e an operation is to be executed. In the exam ple o f  RICA , the 
operations are relatively sim ple, and often m ap to only a few  host instructions— often m uch
31
Em ulation
less than the overhead incurred  to generate or process each event. Furtherm ore, events may be 
triggered m ore than once for each operation (due to flutter) before the core stabilises. This gets 
increasingly w orse as the com plexity  o f  the data paths increase.
This kernel-based approach o f serialising in response to run-tim e events also im poses an over­
head per configuration context. For traditional reconfigurable and dynam ically reconfigurable 
hardw are, the rate o f  reconfiguration is low, so the overhead o f  updating the event-driven m odel 
on each configuration context represents only a sm all fraction o f  the total execution time. H ow ­
ever, reconfigurable instruction cell based processors are reconfigured m any m illions o f  Times 
p e r  second, so this overhead introduced by the m odel is large com pared to the actual w ork done 
by the operations o f the m odelled cells.
Figure 3.3: Modelling sequences of parallel data paths on a serial machine: each set of parallel 
data paths (left) is converted at load-time into a series of instructions (blue boxes) in 
the host architecture’s instruction set (right). The red lines show where the state of the 
two machines must match, and these correspond to basic block boundaries. Within a 
basic block on the target architecture, the same series of instructions on the host must 
work with any given data, each time the basic block is called.
32
E m ulation
Therefore, m oving this overhead into a pass prior to program  execution is highly desirable, and 
is analogous to partial evaluation [71]. A  program  (p ro g ) can be thought o f  as a transform  to 
convert static input data (Istatic) and dynam ic input data (I  dynamic) int0 output data (O ), i.e.:
p ro g  : I  static X I  dynamic > O
Partial evaluation com putes a residual program  (prog*) from  the original program  and the static 
input data, that converts ju s t the dynam ic input data into the output data, i.e.:
p ro g  : I  dynamic * O
This is w hat the softw are-based em ulator presented  in this thesis does— it com putes the resid­
ual program , and executes that at run-tim e. The em ulator m oves aw ay from  the event-driven 
approach, and instead m im ics the sam e order o f  data flow by generating  a static schedule o f  op­
erations that are perform ed sequentially, as show n in figure 3.3. The algorithm  for generating 
this serialisation, along with the required storage queues, is described in section 3.3.2. This is 
a new extension to traditional softw are-based em ulator technology, allow ing this type o f model 
to w ork w ith these unusual architectures.
3.1.3 Contribution: Load-Time Serialisation
The key idea proposed in this chapter is to im prove sim ulation speed by pushing as much 
w ork into the pre-execution phase as possible. If  w e assum e that, like w ith m icroprocessor 
em ulation, the architecture behaves according to specification, then the required  granularity  o f 
state coherence is that equivalent to an instruction. An instruction in a data path m achine such 
as RICA , can be seen to be the state change resulting from  the execution o f  a single iteration 
o f a single configuration context. T he instruction therefore consists o f a num ber o f dependent 
and independent operations executing in parallel, o r com binatorially  (since operation chaining 
is allowed). N ote how ever that the term  instruction  is m ore com m only used to describe the 
individual operations in the data path (i.e. the functionality  o f  the cells), since these are w hat 
are captured by the instructions in the assembly.
If  the target program  (netlist) is properly form ed, the steps w ill all have been  program m ed to 
be given enough tim e for the final results to  stabilise before sw itching to the next iteration/step, 
thus m aking the results determ inistic. This is the sam e as saying that the architecture perform s 
to specification. T herefore, if  ju st the intended functionality  o f  the program  is to be sim ulated, 
we can assum e that the program  is properly form ed, in w hich case only the final results o f the 
data paths are needed - i.e. the em ulation need only ensure that the state m atches that o f  the real 
hardw are at the end o f each step, and not necessarily  anyw here in betw een. These determ inistic 
final results can, by definition, be evaluated by the sam e sequence o f  operations each tim e (given 
the sam e initial m achine state). This is com plicated  by the presence o f  operation chaining, as 
the operations in each chain  m ust be executed in the correct relative order.
33
Em ulation
N ote that sim ply recording the sequence o f events generated by a sim ulator is one possible 
solution to this, but not an ideal one since m any operations are executed m ore than once, which 
is redundant from  the point o f  view o f final state (at the end o f the step). Furtherm ore, there 
may be a certain extent o f  data set sensitivity, w here som e transitions m ay be m issed during 
recording due to the value not changing for the particular data set used, but w here in general, 
the value could change.
A m ore efficient approach would be to determ ine w hat o rder to execute each operation in or­
der to achieve the sam e results. Furtherm ore, it w ould be good to m inim ise the num ber o f 
operations executed— avoiding redundant executions. The serialisation algorithm  proposed in 
section 3.3.2 consists o f  a topological sort o f  ‘ac tions’ corresponding to  each operation. There 
are additional ‘actions’ used to capture synchronous effects such as the updating o f registers at 
the end o f  a step, and actions to generate initial values.
34
E m ulation
3.2 The Modelled System
Figure 3.4: Modelled system: reconiigurable core (simplified), memory, and example peripherals.
An exam ple system  that can be m odelled w ith the em ulator is show n in figure 3.4, and consists 
o f the instruction cell array core, w ith separate program  and data  m em ories, and som e sim ple 
peripherals. In this exam ple, data m em ory is arranged in m ultiple banks, accessed through 
special cells in the array. S ince m ore than one m em ory access cell is provided in the array, 
m ultiple accesses can be perform ed by the core in one configuration context. If all such accesses 
are to d ifferent banks, then these accesses are perform ed in parallel. O therw ise, conflicting 
requests are perform ed sequentially, w hich incurs a dynam ic delay.
The em ulator can be used to characterise m em ory access patterns and use the results to direct 
scheduling and linking o f the program  (chapter 4) to optim ise access to data m em ory (feedback- 
derived optim isation, m entioned in section 3.1).
Softw are em ulations o f m em ory-m apped peripherals such as a D M A  controller, v ideo fram e 
buffer, o r audio buffer can easily be added. T hese com m unicate w ith the core ju s t like they 
w ould in  real life: e ither through the m em ory interface, o r through special-purpose cells in the 
array. New instruction cells can be added to the core sim ply by defining a new  object. New 
m em ory-m apped peripheral m odules can be added by defining a new object fo r the periph­
eral, w hich responds to events from  the m em ory interface through know n m ethod calls, see 
figure 3.5, in response to activity on the appropriate addresses.
35
E m ulation
Peripherals can have a kernel that operates on a separate thread, if  they are to perform  operations 
that are independent to the core. T he video fram e buffer em ulation is an exam ple o f  this: it 
perform s colour space conversions and renders fram es largely on its own tim e-base. M ore 
com plex peripherals, such as a D M A  controller, can be added that connect to both the m em ory 
interface and to the array via special control cells. T hese could be im plem ented by creating a 
new object for the special cell, and allow ing the cell object to com m unicate with the m em ory 













This section describes the technologies and abstractions involved in the R IC A  em ulator, w hich 
give the ability o f  high execution speed (by reducing the per-step overhead), and allow s the 
functionality  to be extended w ith little effort.
T he em ulator is an object-orientated  program  w ritten in C++, and is m odular in design. Each 
hardw are com ponent m entioned in section 3.2 is represented by a class (object), and they com ­
m unicate with each other via m ethod calls. The model o f  the core is sim ply a set o f  instruction 
cell m odels, each o f  w hich contains the state inform ation that the real cell w ould m aintain, and 
a set o f cell actions  w hich capture the behaviour o f that cell. T he cell actions are im plem ented 
as C++ m ethods  (m em ber functions). T he operation  o f  a given cell is represented  by one or 
m ore o f  the follow ing cell actions:
Evaluate: A ssign the output value o f  the cell and/or m odify the internal state o f the cell ac­
cording to the configuration word.
Operate: A ssign the output value o f the cell according to  the configuration w ord and the values 
read from  its input(s).
Update: M odify the internal state o f  the cell according to the configuration w ord and values 
read from  the input(s).
A  serialised configuration context consists o f  the evaluate  actions (scheduled  in any order), 
follow ed by the operate  actions (specifically ordered by the serialisation algorithm  described 
in section 3.3.2), follow ed by the update  actions (in any order). Cells that perform  only sim ple 
com binatorial operations— w hich calculate an output value based on the values o f  their inputs—  
im plem ent only the operate  action. The code sam ple in figure 3.6 dem onstrates a sim plified 
version o f an a d d  cell, w hich is an exam ple o f a com binatorial cell.
The em ulator parses the netlist describing the target program , then serialises the operations o f 
each configuration context into a sequence o f  equivalent cell actions. These serialised opera­
tions are stored in an internal data m odel. The serialisation process is described in section 3.3.2. 
Execution o f  the program  then proceeds: these sequences o f  cell actions for each configuration 
context encountered are executed in a large state m achine by calling  the appropriate virtual 
function, as shown in figure 3.7.
The m odel o f  each cell contains a variable that holds the value fo r the ce ll’s output port. This 
can then be referenced (read) by the actions o f  cells that depend on that value (the i n p u t  vector 
passed into the o p e r a t e  () m ethod). N ote that the program  counter can also be updated via 
the cell actions (for the ju m p  cell), and this determ ines w hich configuration context will follow. 
A configuration context is the sm allest unit that can be used as the target for jum ps. The data 
m em ory is m odelled as a sim ple array w rapped by an object that provides an interface to read 
and w rite to  the m em ory, as described in section 3.2.
37
E m ulation
object Add_cell extends Instruction_cell 
{
properties :
- output // Storage for cell's output, 
constructor :
- define cell configuration and input ports, 
methods :




case ADD_ADD_SI: // Single integer.
output = ini + in2 
case ADD_SUB_V2HI: // Vector mode, 






Figure 3.6: Simplified add cell class implementation pseudo-code.
// Execute steps until end condition is detected, 
do 
{
step index = jumpcell program counter value 
this step = program[step index] 
for each cell action in this step 
{















while jump cell hasn't detected end
Figure 3.7: Core execution loop pseudo-code.
3.3.1 Extensibility
To tackle the goal o f easy extensibility— w here the least am ount o f  effort is needed to later 
add new functionality to the m odel— a com bination o f  techniques w ere used: subclassing, p re­
processor m eta-program m ing, tem plates, and scripting. These will becom e apparent in the 
description that follows.
Certain key hardw are concepts such as instruction cells and the m em ory interface w ere gen­
eralised by decom posing them  into sim ple interface descriptions, w hich w ere then described
38
E m ulation
using C ++ abstract classes. F or instance, the concept o f  an instruction  cell is represented  by 
the abstract class I n s t r u c t i o n . c e l l ,  from  w hich all concrete cell types are derived. A 
m inim um  set o f concrete subclasses o f  these w ere created to describe the particu lar arch itec­
ture described in section 3.2. This set o f  concrete subclasses is intended to be added to by an 
end-user, to m odel d ifferent system s.
To m inim ise the cost o f design-space exploration, the em ulator shou ldn’t have to be recom piled 
in order to  m odel a different core. Instead, the core should be defined by a user-provided 
m achine descrip tion  file (M D F), w hich lists the types and instance counts o f  the instruction  
cells in the core, and other inform ation about the m odelled system . To allow  the geom etry  o f 
the m odelled core to be defined at load-tim e, according to the given M D F, the ability  to spawn 
new instances o f  a particular class at load-tim e is needed. A t load tim e, the M D F is parsed, and 
each cell instance defined there is instantiated  in the core by requesting  a new  instance o f  the 
specified cell type nam e.
T he concept o f  a class factory was used to allow  a new  instance o f  a  cell to be obtained by 
nam e. This is achieved by splitting the concept o f a cell into tw o parts: the cell instance, and 
a corresponding factory. The factory is a separate class, w hich spaw ns new instances o f  the 
corresponding cell type, and can be queried for other inform ation about the corresponding cell 
type. An instance o f  this factory is created for each cell type w hen the em ulator loads. Each 
cell factory derives from  the I n s t r u c t i o n _ c e l l _ f a c t o r y  base class. The base class is 
im plem ented as a singleton: at m ost one instance is allow ed to exist. This singleton instance 
m aintains a registry o f  cell types— or m ore specifically, records a po in ter to the cell factory 
instance corresponding to each cell type nam e. C ode in the em ulato r’s loading logic deals only 
w ith the singleton cell factory instance, requesting from  it instances o f  cells by nam e.
Each cell type requires a corresponding factory class, in order for that factory to register the 
cell type w ith the runtim e. As part o f  the constructor, these factories register them selves with 
the base class I n s t r u c t i o n _ c e l l - f a c t o r y .  Since the functionality  o f  each cell factory  is 
iden tical,1 these factories are described by a C ++ class tem plate, w ith the tem plate param eter 
being the cell class to return .2 This m eans that to register a  cell type w ith the system , only a sin­
gle line o f  code is needed: a file-static instantiation o f the factory tem plate, with the particular 
cell type given as the tem plate parameter.
This process is fu rther autom ated by a build  script, w hich scans a predefined directory  for C++ 
header files representing  cell types, and generates a single header file ( c e l l - d e f i n i t i o n s  . h h )  
that ( ( i n c l u d e s  each o f  these.
C ertain auxiliary  inform ation is needed for each cell type, e.g. to allow  user-configurable op­
tions to be passed to the cells via com m and-line options to the em ulator, and to define w hich 
configurations are supported .3 This inform ation is referred to in various different places in 
the em ulato r’s m ain source code. To m aintain readability  (and thus m aintainability), all o f 
this inform ation is contained in the header file fo r each cell type. N orm ally, such inform ation 
w ould be provided as part o f  the cell class im plem entation, via the C ++ virtual function m ech­
anism . However, m uch o f  this inform ation is referred to in the m ain execution loop, w hich
1 with ju s t the returned cell class type being different.
i.e. the nam e o f the corresponding concrete subclass o f  I n s t r u c t i o n _ c e l l .
3 and their nam es, so that these m ay be substituted in debug inform ation, to im prove readability.
39
Em ulation
is extrem ely perform ance critical. Therefore, querying inform ation via virtual function calls 
in such situations could reduce perform ance by an order o f m agnitude or m ore, considering 
that the tim e taken to perform  a virtual function call (or even a norm al function call) is often 
much larger than the tim e taken to execute a cell action. To avoid this, the inform ation is pro­
vided via case statem ents and look-up tables, w hich are resolved at com pile tim e. To construct 
these tables and case statem ents, the C pre-processor is used— a technique called pre-processor 
m eta-program m ing [72].
#if defined INSTRUCTION_CELL
// Define the class name for this type of instruction cell: 
INSTRUCTION_CELL(Add_cell)
#elif defined CONFIGURATION_NAMES













// List the options available for this cell.
CELL_OPTIONS(Add_cell,
CELL_OPTION(d i_mode_w i dt h, "64",
"Sets the bit-width for double integer mode.")
#else










// Cell class definition: 




Figure 3.8: Pre-processor guarded sections in a typical cell type implementation header file (with 
class implementation details omitted).
40
E m ulation
The C ++ header file fo r a cell type is effectively split into sections. A n exam ple is show n in 
figure 3.8. T he header file is parsed several tim es during com pilation: once for each section. 
Each section is contained in a # i f d e f  < s e c t i o n - n a m e >  b lock, w here the appropriate 
section nam e m acro is used. The last section is unnam ed (i.e. uses # e l s e ) ,  and contains 
the cell type class definition. This is w hat will be seen w hen including the header file in the 
norm al m anner. The o ther sections consist o f  m acro expansions, w hich supply inform ation that 
is expanded in the appropriate places in  the em ulator source code. Each section m ay be used in 
m ore than one place in the source code, potentially  expanding into to tally  different code each 
tim e. T his is the m ain reason for using this technique: to autom atically  keep related fragm ents 
o f  code in sync after updating, w here these fragm ents cannot be cleanly  located together in the 
source files; usually a result o f  lim itations in the language.
Figures 3.9 and 3.11 show  the pre-processor m eta-program m ing tem plates for instantiating  the 
cell factories and registering the supported configuration nam es for each cell type, respectively. 
These appear in code that is internal to the em ulator, that shou ldn’t need to be m odified when 
new cell types are added. The corresponding expansions for these are shown in figures 3.10 
and 3.12, respectively.
// Register the instruction cell factories, by creating singleton 
// static instances of each.
#define INSTRUCTION_CELL(instruction_cell_class) \
static Named_instruction_cell_factory<instruction_cell_class> \ 
instruction_cell_class##_factory;
#include "cell-definitions.hh" // Auto-generated file referencing
// all the cell header files.
#undef INSTRUCTION_CELL
Figure 3.9: Source code extract for auto-generating each cell type factory class, along with a file- 
static instance of it.
static Named_instruction_cell_factory<Add_cell> Add_cell_factory; 
static Named_instruction_cell_factory<Mux_cell> Mux_cell_factory;
static Named_instruction_cell_factory<Reg_cell> Reg_cell_factory;
Figure 3.10: Example auto-generated source code resulting from the pre-processor meta­
programming in figure 3.9.
T he result o f  these features is to make adding a new  cell type consist o f sim ply adding a new 
header file describ ing the cell type, and recom piling the em ulator. The resulting binary can then 
be used w ith an updated M D F and netlist w hich refer to the new  cell type. It can even list usage 
inform ation for the new ly added cell types.
41
Em ulation
// Register the supported configurations for each cell type, by 
// implementing the appropriate member function for each cell factory. 


















#include "cell-definitions.hh" // Auto-generated file referencing
// all the cell header files.
#undef CONF IGURATION_NAME 
#undef CONFIGURATION_NAMES
Figure 3.11 : Source code extract for auto-generating look-up tables associating a human readable 


























Figure 3.12: Example auto-generated source code resulting from the pre-processor meta­
programming in figure 3.11.
42
Em ulation
3.3.2 Contribution: Serialisation Algorithm
The serialisation algorithm  is used at load-tim e to create the internal representation w hich drives 
the execution state m achine. This internal representation im poses a significantly low er per-step 
overhead than serialising during execution.
The key requirem ent o f  this algorithm  is to ensure that the result o f  executing the sequence 
o f  cell actions in the execution m odel, exactly m atches the result o f  the original data  flow 
graph for that configuration context. This sim ply requires that a ce ll’s action (for calculating  its 
output value) is scheduled before those o f  any dependent cells (successors). T he serialisation 
algorithm  requires extension to deal w ith situations w here cells m aintain internal state from  
one configuration context to the next. To explain this, w e first give an exam ple involving only 
com binatorial cells, then a second exam ple show ing the extension required to  avoid apparent 
connection  loops arising from  internal state.
1 2
Figure 3.13: Example configuration contexts: (a) involving only combinatorial operations, (b) in­
cluding a connection loop— this case is valid since the loop involves a register, which 
is a term inal cell.
The data flow graph exam ple for a configuration context involving only com binatorial cells is 
given in figure 3.13(a). The operation o f any purely com binatorial cell needs only an operate  
action to be defined. The constant cells supply the operands for a set o f operations, and the 
final result is w ritten to storage. A hum an m ight choose a sequence such as that show n by the 
num bers in figure 3.13(a). The algorithm  em ployed by the em ulator constructs the connection 
hierarchy betw een the active cells as a d irected graph. O nce the hierarchy is com plete, the 
topological sort operation from  graph theory is applied to the graph. The topological sort results 
in nodes being ordered in descending order o f depth in the connection  hierarchy. T he ordered 
result is used to schedule the operate  actions o f  each cell. W ithin a given depth, the cell actions 
could be scheduled in any order, w ithout affecting the overall result. The direction o f  the arrows 
in figure 3.13 indicates the d irection o f data flow, and defines the term inology o f  predecessor  
feeding data to  a successor, i.e. one o f  the successor’s input ports is connected to the output 
port o f  the predecessor. In the com pleted  hierarchy, a p redecessor lies in som e level low er than 
that o f any o f  its successors.
43
Em ulation
Things are a  bit m ore com plicated than this, however, because som e cells m aintain internal state 
inform ation. Taking registers as an exam ple, the output o f  the cell does not depend on the input 
in the current configuration context; instead it depends on the internal state o f the register ce ll.4 
This m eans that it is valid for a register to appear in a connection loop— w here the output o f the 
register is used in som e sequence o f operations, the result o f  w hich is stored back in the sam e 
register. This results in a cyclic graph, m aking a topological sort im possible. Essentially, the 
register cell can be thought o f as two cells— one em itting the current value, and one receiving 
the new value. However, this is not a clean approach.
A lternatively, we can introduce the concept o f term inal cells— i.e. cells w here the inputs do not 
affect the outputs during the execution o f that configuration context. Now, connection loops 
are valid if  one o f  the cells in  the loop is term inal. Term inal cells provide an evaluate  action, 
in addition to an operate  action. C alculating the output value o f a term inal cell can always 
be done before anything else during the execution o f  a configuration context,5 and w riting to 
the input(s) o f a  term inal cell can always be done after anything else during the execution o f 
a configuration contex t.6 Furtherm ore, som e cells need to have their state modified upon each 
configuration context transition (reconfiguration). This is done by providing an update  action, 
that is perform ed once the rest o f  the actions have been executed. So, the algorithm  is extended 
by scheduling all evaluate  actions first, follow ed by the sequence o f operate actions obtained 
from  the topological sort, and finally all update  actions are scheduled. F igure 3 .13(b) shows an 
exam ple, to w hich the algorithm  w ould assign the sequence o f  cell actions given in figure 3.14.
const[0](evaluate), const[1](evaluate), reg[0](evaluate),
add[0] (operate), div[0] (operate), reg[0] (operate), reg[10] (operate),
reg[0] (update), reg [10] (update)
Figure 3.14: Cell action execution order for the example step DFG given in figure 3.13(b).
Registers are only a sim ple exam ple o f this problem . M ore com plex exam ples include inter­
faces to stream ing m em ories, and cells that are internally pipelined such that the ir output is 
delayed by (several) iterations. It has so far proven possible to map all supported cells to this 
m echanism , and this approach is quite effective in m inim ising the num ber o f operations that 
need to be perform ed for each configuration context.
4w hich in turn usually depends on the input to the register from  a previous configuration context.
5since the value does not depend on the result o f  any other cell during that configuration context.
<Jsince the written value does not affect any other cells during that configuration context.
Em ulation
3.4 Results
The perform ance o f  the em ulator was com pared against a System C  transaction-level m odel 
o f  the sam e instruction  cell-based processor, and an FPG A  im plem entation  o f  the sam e array 
(i.e. a dynam ic reconfigurable fabric on a  static reconfigurable fabric). A quad 2 .2G H z AM D 
O pteron PC  w as used as the host m achine for the em ulator and System C  m odel. T he FPG A  
used w as the Virtex-4 LX 160.
N ote that an FPG A  im plem entation perform s the sam e role as an H D L  sim ulation o f  the pro­
cessor architecture, and is used instead o f  an F1DL sim ulation since it achieves m uch higher 
run-tim e perform ance, and so is m uch m ore suitable for the task o f near real-tim e application 
dem onstration.
3.4.1 Results: Execution Speed For a Range of Standard Benchmarks
T he execution speed was used as the m easure o f  perform ance. T he reconfigurable array is 
intended to  have a system  clock o f  500M H z. The m axim um  achievable clock on the FPG A  
im plem entation  o f  the target p rocessor is 12M Hz7. The ratio o f these gives the perform ance 
value for the FPGA.
For the o ther m ethods, the execution tim e was accurately m easured and averaged over several 
runs. T he averaging is necessary for user-space program s, in o rder to reduce random  error 
introduced by pre-em ptive context sw itches on the host. Execution speed is the tim e that the 
target application should have run for on the reconfigurable array, divided by the average run 
tim e on the m odel.
The follow ing algorithm s/applications w ere used:
•  D iscrete C osine T ransform  (DCT) (for M PEG 4/H .264 video).
•  F inite Im pulse R esponse (FIR) digital filter.
•  D hrystone (integer CPU  perform ance bench-m ark).
•  M P3 (M PEG-1 layer 3) audio decoder (libm ad).
•  H .264 video decoder (ffmpeg).
The benchm arks w ere chosen to cover the realistic extrem es o f  control-flow  intensive and data­
path intensive applications, w hilst m apping to a core sm all enough to be im plem ented on the 
FPG A . D hrystone is benchm ark targeting traditional m icroprocessors, and aim s to test their 
ability to  process sim ple in teger operations w ith lots o f  control flow. The D C T and FIR  pro­
gram s represent the opposite extrem e— program s dom inated by a single basic block w ith high 
core utilisation, w hich could easily be im plem ented in hardw are. The M P3 and H .264 exam ples 
are real-w orld applications that m ake use o f the DCT, but also have additional logic that leads
’determ ined by the critical path o f  the synthesised instruction cell array rendered on the FPG A , which is the 
same irrespective o f the target application.
45
Em ulation
to control flow8. The purpose o f this is to expose w here the relative overheads lie betw een the 
different softw are-based sim ulation m ethods. The host-native perform ance figures are quoted 
to illustrate how well the applications m atch the capabilities o f the host. The difference betw een 
the host-native perform ance and the perform ance o f the em ulator or System -C m odel can be 
attributed to tw o factors: the overhead o f the sim ulation technique used, and the difference in 
quality o f the optim ising back-ends for the host-native com piler v.s. that for RICA.





FIR 1.000 3.40e-3 0.52 21
DCT 1.000 5.52e-3 1.47 61
H.264 1.000 9.44e-3 1.43 59
MP3 1.000 12.00e-3 2.43 101
D hrystone 1.000 76.00e-3 0.83 34
Table 3.1: Execution speed for various standard benchmarks, normalised to the speed of the em­
ulator. The emulator is two orders of magnitude faster than the traditional SystemC 
model, and nearly as fast as an FPGA implementation of the target architecture. The 
overhead of emulation v.s. the overhead of SystemC’s events is application dependent—  
the emulator is most advantaged for data-path intensive applications.
Table 3.1 shows that the perform ance o f the em ulator described in this thesis is good com pared 
to the other sim ulation m ethods described. The real silicon (native) is between 21 and 101 
tim es faster than the em ulator, and the FPG A  model is close in speed to the em ulator. S ince 
the FPG A  is a model o f  the real silicon, it is a constant fraction o f  the speed o f the real silicon. 
Both softw are m odels vary in execution speed (com pared to the real silicon), depending on the 
application.
The relative perform ance o f the em ulator and System C m odel can also be seen to depend on the 
application. Since these two m odels use very sim ilar cell im plem entations, w ritten in C, this 
h ighlights the differences in the overheads incurred by the m ethod o f sim ulation. In addition to 
perform ing the actual work o f  the cells, the System C kernel incurs an overhead for each event 
generated by the active cells, and a further overhead at the end o f each configuration context. 
The em ulator on the other hand, only incurs the latter overhead, since everything except for the 
path o f program  execution is serialised prior to execution.
The D hrystone exam ple consists o f  many short basic blocks, w hich results in very low core 
utilisation. This represents the extrem e o f frequent configuration context changes w ith few  cell 
operations in between. The FIR  exam ple represents the opposite extrem e, w here the program  
consists largely o f one basic block, w hich results in very high core utilisation, and m uch core 
activity betw een configuration context sw itches. The results in table 3.1 show that the em ulator 
is best advantaged when core utilisation is high, w hich supports this argument.




3.4.2 Results: Effect of Data Path Shape
To exam ine this variance in relative perform ance (execution speed) betw een the em ulator and 
the System C  m odel w ith different program s, som e small test program s w ere w ritten. Each 
consists o f  a single loop m apped into a single configuration context. In each case, the loop 
body consists o f  a relatively sim ple sequence o f  arithm etic operations to apply to each m em ber 
o f a data set. The program s differ in when the operations for a given m em ber o f  the data set are 





Parallel 26 16ns (5 operations) 1246x
C om binatorial 26 40ns (9 operations) 1377x
Sequential 11 16ns (5 operations) 2258x
Table 3.2: Complexity and relative execution speed (emulator v.s. SystemC model) for some sim­
ple test programs written to investigate the reason for the application-dependent rel­
ative execution speed. Execution time for the emulator depends only on the number 
and type of operations present, and not their order. The SystemC model however is 
affected by data path dependencies (demonstrated by ‘Parallel’ v.s. ‘Combinatorial’). 
The SystemC model incurs a higher per-step/iteration overhead (shown by the degraded 
relative performance in the ‘Sequential’ example).
(a) Q Q Q Q  
O
(0 - g  
= 3
Figure 3.15: Visual representation of the kernels used in table 3.2. Red lines show configuration 
context boundaries, and the blue circles represent the data path that is replicated. 
Time runs vertically, (a) Parallel: the four copies of the data path all run in parallel 
in the same configuration context, (b) Combinatorial: two copies of the data path are 
chained together, and two copies of this macro are executed in parallel in the same 
configuration context, (c) Sequential: the configuration context contains only one 
copy of the data path, and so has to complete four times as many iterations to process 
the same amount of data.
To test the effect o f  the num ber o f events generated per iteration, one program  (Parallel—  
figure 3.15(a)) perforins the operations o f four m em bers o f the data set in parallel; w hilst an ­
other program  (Combinatorial— figure 3.15(b)) also operates on four m em bers o f  the data 
set per iteration, but a data dependency exists preventing them  from  running entirely  in parallel
47
Em ulation
(however they still overlap to a certain  extent). The num ber and type o f operations perform ed 
per iteration in both o f  these program s is the sam e; how ever the latter (Combinatorial) 
case has a longer critical path. The relative perform ance o f  the em ulator and System C model is 
sim ilar for both program s, the results o f  w hich are shown in table 3.2.
The execution tim e o f  the em ulator should depend only on the operations perform ed, and not 
the order. For the System C m odel, the longer critical path (and num ber o f  operation chained 
together) causes m ore flutter as the com binatorial paths stabilise, resulting in m ore transition 
events being generated. However, the execution tim e for each event is very small com pared to 
the tim e taken to schedule the events, and the results in fact show a slight relative gain. This 
indicates that the run-tim e scheduling is easier w hen the tim ing o f the events is m ore sequential.
To test the effect o f the num ber o f  operations per iteration, another program  was w ritten 
(Sequential— figure 3.15(c)), this tim e w ith only one m em ber o f  the data set operated on 
per iteration o f  the kernel. This requires that four tim es as m any iterations are perform ed. A sig­
nificant increase in the relative speed o f  the em ulator can be seen com pared to the previous test 
program s. This therefore indicates that the System C m odel incurs a disproportionately large 
overhead per iteration, w hich supports the earlier observation w ith the standard benchm arks.
T he source code to all three program s can be found in appendix A, along w ith the data flow 




This chapter p resented  algorithm s and m ethodologies used to im plem ent a h igh-speed sim ulator 
(em ulator) fo r the R IC A  architecture. Such a sim ulator is im plem ented entirely  in softw are. In 
order to sim ulate a data path architecture on a conventional m icroprocessor, the data paths must 
be broken up into an equivalent sequence o f  operations on the m icroprocessor (i.e. they m ust 
be serialised).
Traditional sim ulation m ethodologies for data path m achines w ere described— H D L  sim ulators, 
and their derivatives. T hese can be used to sim ulate RICA , but since R IC A  is in tended to 
be reconfigured very frequently  during norm al operation, sim ulations tend to  be very slow. 
This is because the serialisation is perform ed by the sim ulator upon every iteration o f  every 
configuration context.
T he approach proposed in this thesis takes advantage o f know ledge about how RICA  changes 
state, w hich allow s the serialisation to be perform ed in advance— before running the program . 
This m oves the overhead from  run-tim e to  load-tim e, and is a constant cost am ortised over 
the entire execution tim e o f the application. This m akes it particularly  advantageous for long- 
running program s, w here execution tim e is also m ost significant.
T he em ulator operates on the set o f  configuration contexts that describe a given program . The 
next chapters look at o ther aspects o f  the tool chain: algorithm s and m ethodologies for creating 
those configuration contexts, and m axim ising their perform ance. The em ulator can be used as 





The overall design prem ise in this w ork is to use C source code to program  a coarse-grained 
reconfigurable com puting architecture. An im portant feature is the ability  to use dynam ic re­
configuration to tim e division m ultiplex sections o f a design larger than the target array. The 
tool chain to do this has been partitioned into separate tools: a com piler, a scheduler, and a 
routing tool.
Figure 4.1: Process of converting C source files into a set of configuration contexts for the target 
reconfigurable array, partitioned into a tool chain. The tool and files relevant to this 
section are highlighted.
T he reason for this partitioning is as follows: in o rder to leverage the w ealth  o f  existing com ­
piler technology and optim isation passes, an existing com piler w as used (GC C [73]). G CC is 
designed to be generic; the extent o f  this generality  adequately  covers conventional com puting 
architectures, and others have shown how it can also be extended to less conventional architec­
tures [74, 75]. However, the nature o f coarse-grained reconfigurable com puting architectures 
breaks too m any o f the assum ptions inherent in this fram ew ork, and so is difficult to capture. 
This results in poor core u tilisation. So, the com piler is used to generate an in term ediate rep­
resentation that a stand-alone tool can then w ork on, to better extract the available parallelism . 




Finally, routing is done separately since the search space is very large, and so the process is 
rather tim e consum ing. Routing needs to be done in order to program  the real device, how ever 
the behaviour and approxim ate tim ing can be analysed w ithout having to perform  routing. The 
tim e saved by only perform ing routing when needed, drastically im proves the iteration rate o f 
the design cycle, w hich makes w orking w ith the target architecture very sim ilar to w orking 
w ith a conventional m icroprocessor, rather than what is com m on with reconfigurable hardw are 
(using HDL).
4.1 Problem Description
The interm ediate representation taken from  the com piler is in the form  o f a serial instruction 
stream. This instruction stream  is to be m apped onto a core that has the potential fo r large 
am ounts o f instruction-level parallelism . Each instruction represents the operation o f  a  partic­
ular instruction cell in the array. The instructions are grouped into basic blocks, inside w hich 
there is no conditional flow control betw een instructions. An ideal schedule for each basic block 
o f  this instruction stream  will consist o f each independent data path in  a given basic block ex­
ecuting in parallel, such that the critical path o f the schedule as a whole is that o f the longest 
data path. D eviations from  this ideal will be necessary if insufficient resources are available in 
the core or if  certain tim ing constraints can ’t be met, in w hich case the goal is to m inim ise the 
increase in total latency, w hilst still executing all o f the data paths. This allows designs (m uch) 
larger than the core itself to be executed.
The task o f scheduling is com plicated by the presence o f  a paradox: scheduling requires the 
calculation o f each data path’s critical path length (delay). However, this critical path length 
depends on the length of in terconnect used to connect each cell in the data path together. This 
cannot be know n until after scheduling and routing have been perform ed, leading to infinite 
regression (c.f. w hich cam e first, the chicken or the egg?). This can be partially avoided by 
m aking a sim plification: the scheduling estim ates the delay o f the interconnect. This estim ate 
is based on an em pirically obtained average interconnect leng th 1 m ultiplied by the m easured 
delay o f a path segm ent and associated s-box. This average is obtained by analysing the fully 
routed configuration contexts (steps) o f a statistically significant set o f program s m apped onto 
a given array. This is then given as a property o f the target array. Later, m ore accurate tim ings 
can be calculated from  the routed netlist, and tim ing-sensitive configuration values adjusted if 
required (i.e. the R R C  fields).
The data model was designed to be operation-centric. The interdependencies betw een the op­
erations are captured in two ways: direct via connections, and indirect via constraints. Physical 
registers are represented as operations, and there are several types o f register operation: input 
register, output register, tem porary register, and pipeline stage register. The operations do not 
directly correlate to cell configurations. This is because the task o f  scheduling is largely about 
inferring w ires and registers from  w hat appears in the assem bly2, and defining additional regis­
ter usage3. This operation-centric m odel has to be transferred to a cell-centric m odel for netlist 
generation.
1 in term s o f  num ber o f path segm ents.
'w h ich  m ainly uses registers as the interconnect agents,




To produce a scheduling tool that allow s the assem bly produced by a com piler to be efficiently 
packed into configuration contexts for program m ing a dynam ically  reconfigurable array. M ore 
specifically:
Correctness: To construct a schedule o f  configuration contexts that run sequentially  to  per­
form  the intended functionality  o f any given basic block. The schedule m ust adhere to 
the available resources in  the target core, and the resulting state change after executing 
the schedule m ust m atch that im plied by the assem bly after having executed all the in ­
structions in the basic block. It m ust also obey a set o f other architecture-specific criteria, 
such as m axim um  representable step tim e (RRC field overflow).
Efficiency: The resulting  schedule should consist o f  as few configuration contexts as possible 
(to m inim ise program  m em ory overhead), and the total o f  the context critical paths should 
be as sm all as possib le (to m ake it as fast as possible). This therefore involves attem pting 
to parallelise the data paths o f  the data flow graph.
F urtherm ore, by m apping loops to individual configuration contexts, p ipelin ing can be applied 
to dram atically  im prove throughput (discussed in chapter 5). T herefore, the w ork on scheduling 
can be v iew ed as being for the purpose o f generating basic blocks that are good candidates for 
p ipelining.
4.1.0.2 Objectives
•  D evise a data model that can describe a w ide range o f  target architectures, in a m anner 
that allow s for easy static analysis.
•  D evise a series o f  algorithm s that operate on this data m odel, to transform  basic blocks 
into valid configuration contexts.
4.1.0.3 Novelty
List scheduling is extended to im prove the ability to  pack data paths into as few a steps as 
possible, in an algorithm  called  a  Tree fo llow er  (section 4.9). This com prises a new  layer 
built on top o f  list scheduling, w hich can dynam ically  re-order the ready list in order to  give 
precedence to operations that lie on the current data path (or arm  o f that data path).
A s a side effect o f this packing, m ore data paths can becom e split across step boundaries, 
requiring registers to store the values o f the broken connections over each step boundary. For 
cores w ith a very lim ited num ber o f registers, this can lead to register starvation. A series o f 
algorithm s w ere devised to avoid this— R egister starvation avoidance  (section 4.10).
Furtherm ore, a series o f  optim isation and analysis passes are presented that im prove the schedul­
ing efficiency— L ive register identification  (section 4.7), and aid the routing tool to achieve a 
m ore optim um  allocation— G lobal live register reallocation  (section 4.12), w hich im proves 




To illustrate the purpose o f the various algorithm s, a sim ple exam ple assem bly is presented here 
in figure 4.2, w hich is to be executed on an array with the resource counts given in table 4.1. 
Both the array and the basic block have been chosen to be very m uch sm aller than w hat would 
be typical, for the purpose o f  m aking it easier to com prehend.
The effect o f the key stages o f scheduling, from  assem bly to abstract netlist, will be dem on­
strated. At the end o f  this process, an abstract netlist will be obtained. The exam ple is then 
taken slightly further, to illustrate how the netlist could be m apped onto the array, follow ing 
allocation & routing.
blockl:
ADD out = rO ini = rl in2 = r2 conf = 'ADD_SUB_SI
CONST out = r3 conf = 4
SHIFT out = rO ini = rO in2 = r3 conf = 'SHIFT_SLL_S1
CONST out = r4 conf = 1
ADD out = r5 ini = r5 in2 = r4 conf = 'ADD_ADD_SI
CONST out = r2 conf = 3
MUL out = rl ini = r2 in2 = rl conf = 'MUL_MUL_SI
MUL out = r8 ini = r8 in2 = rl conf = 'MUL_MUL_SI
CONST out = r 4 conf = 5
ADD out = r 6 ini = r 6 in2 = r4 conf = 'ADD_ADD_SI
MOV out = r 4 in = rl
MOV out = r2 in = r 8
Figure 4.2: Example assembly for a basic block. This example contains 4 independent data paths.






Table 4.1: Available instruction cell resource count for a hypothetical, artificially small RICA ar­
ray.
Perform ing data flow graph (DFG) analysis on the basic block, w e can determ ine the connectiv­
ity betw een the various operations. The D FG  is flattened— interm ediate registers are replaced 
w ith wires. The only registers that rem ain in this data m odel are those that bring values into the 
basic block (term ed input registers), and those that bring values out o f the basic block (term ed 
output registers).
The flattened DFG is shown in figure 4.3. A single basic block may consist o f  several com ­
pletely independent data paths, as can be seen in this exam ple. To illustrate w here they com e 





e1 V J  e2
C ADD J (CONST J 
e3 ^ e 4  V e4
~  0
Data path 2 Data path 3 Data path 4
Figure 4.3: Data flow graph (DFG) extracted from the assembly in figure 4.2.
// Data path 1
ADD out = rO ini = rl in2 = r2 conf = 'ADD_SUB_SI
CONST out = r3 conf = 4
SHIFT out = rO ini = rO in2 = r3 conf = 'SHIFT_SLL_SI
// Data path 2
CONST out = r4 conf = 1
ADD out = r5 ini = r5 in2 = r4 conf = ' ADD_ADD_SI
// Data path 3
CONST out = r2 conf - 3
MUL out = r7 ini = r2 in2 = rl conf = 'MUL_MUL_SI
MUL out = r8 ini = r8 in2 = rl conf = 'MUL_MUL_SI
// Data path 4
CONST out = r4 conf = 5
ADD out = r 6 ini = r 6 in2 = r4 conf = ' ADD_ADD_SI
// Data path 3 (continued):
MOV out = r 4 in = rl
MOV out = r2 in = r8




Looking at ju st the instructions from  this basic block, a first approxim ation to the input and 
output registers can be made:
Input registers: r l ,  r2, r5, r6, r7, r 8 
Output registers: rO, r l , r2, r3, r4, r 5 , 16, r7, r 8
The nodes o f the data flow graph are registers or operations, each corresponding to a physical 
instruction cell in the array. The edges o f  the data flow graph represent pieces o f inform ation 
(values) passed betw een the nodes. Table 4 .2 shows the edges o f the data flow graph, and the 
operations that create their value.
Edge Value represented O utput registers
e l input from  rl rl
e2 input from  r2 -
e3 result of: rO <— r l  ADD r2 -
e4 result of: r3 +- C O N ST 4 r3
e5 result of: rO <- rO SH IFT  r3 rO
e6 result of: r4 <- CO N ST 1 -
e l input from  r5 -
e 8 result of: r5 <- r5 ADD r4 i-5
e9 result of: r2 <- CO N ST 3 -
elO input from  r7 -
e l 1 result of: r7 <- i-2 M U L x l r4, xl
e l 2 input from  r 8 -
e l3 result of: r8 <- r8 M U L xl r2 , i-8
e l4 result of: r4 <- CO N ST 5 -
e 15 input from  r6 -
e l 6 result of: r6 <- t-6 ADD r4 r6
Table 4.2: All edges from the example data flow graph in figure 4.3, and the corresponding assem­
bly in figure 4.2.
If the entire data flow graph can fit on the array at once, then all the edges becom e w ires4. 
O therw ise, the data flow graph has to be split into fragm ents, each o f w hich are small enough 
to fit on the array, and the fragm ents are executed sequentially .5 Any individual data paths that 
are too big to fit on the array in one step must be split. The edges that are split must be replaced 
by physical registers in the core, to store the tem porary value, w hich is then read back in som e 
later step when the array is reconfigured with the rem ainder o f the data path. Registers used 
for this purpose are called tem porary registers. The scheduling algorithm  is responsible for 
choosing the best places to split large data paths, and how the fragm ents o f different data paths 
are packed together into steps. F igure 4.5 shows the resulting schedule for the exam ple basic 
block.
“'th e  interconnect in the core,
time division m ultiplexing the array resources.
56
Schedu ling
Data path 3 Data path 1 Data path 2
Figure 4.5: Data flow graph from figure 4.3 scheduled for the example array defined in table 4.1.
There are insufficient cells available for this basic block to become a single step, so it 
has been split. Data path 3 is spread across both steps, requiring a temporary register 
(dotted outline) to transport the value of the broken edge across the step boundary.
A cross each step boundary, the scheduler m ust choose w hich registers to use for storing the 
tem porary  results. In our exam ple, all the registers are active in the basic block. This m eans 
that there are no registers available purely  for use as tem poraries. As a result, the scheduler m ust 
try to re-use the active registers for this purpose. S ince the steps corresponding to  a given basic 
block are always executed in sequence, the state o f  the registers needs only to be preserved on 
entry to the first step, and on exit from  the last step. This allows the scheduler to use any o f  the 
active registers in any way that it likes across any step boundary internal to this basic block. As 
a result, the scheduler considers all values as tem poraries across each internal step boundary, 
as show n in figure 4.6. For each step boundary, only a single register needs to be assigned to 
store each edge w here the producer and any consum er o f that edge lie on opposite sides o f  that 
boundary. T herefore, duplicates only need to be stored once, e.g. e l l  in the exam ple needs only 
one tem porary register to bring it into the second step, despite the edge needing to be stored in 
tw o output registers.
Registers are assigned to tem poraries across each internal step boundary from  the available 
pool. This pool consists o f  all the active registers, plus any additional registers that are know n 
to be dead. O nce all operations that read the value brought in by an input register have been 
executed, that value no longer needs to be stored (it becom es dead). Sim ilarly, the value brought 
out by an output register only needs to be stored once the operation creating that result has been 
executed. This leaves som e o f  the registers free for storing other internal edges across internal 




Figure 4.6: Example schedule from figure 4.5 showing all the temporary registers (shown by a 
dotted outline) that are needed to bridge values across the internal step boundary, as 
seen by the scheduling algorithm.
The size o f  the available pool o f registers is crucial to how well the scheduler can parallelise 
the code. If insufficient registers are available to store the tem porary registers over any given 
step boundary, then a new  schedule has to be constructed (with relaxed tim ing), with less par­
allelism . In many (if not most) situations, the active registers alone are insufficient for this 
purpose. Inactive registers cannot safely be used, as they could be storing inform ation across 
this basic block for use later in the program  (term ed dorm ant registers). A sim ple solution to 
this is to reserve a fixed set o f registers entirely for the purpose o f  storing tem poraries— term ed 
scratch registers.
An alternative approach is to  traverse the entire control flow graph (CFG) o f  the program , 
tracking the possible uses o f  each register. This inform ation describes w hich inactive registers 
are dorm ant. All other inactive registers are free to use for storing tem poraries. Furtherm ore, 
it is possible to determ ine w hich o f the active registers store useful inform ation on exit from  
the basic block. All others therefore becom e free for use as tem poraries on exit from  the step 
that last reads their value. W ithout this inform ation, all active registers must be considered live 
on exit from  the basic block. This control flow graph traversal process is called live register 
identification  (section 4.7). It is not used in this exam ple, since the other basic b locks in this 
program  are not shown.
The schedule w ith registers allocated betw een each step, can then be w ritten in the form  o f an 
abstract netlist, as shown in figure 4.7, augm ented with tim ing inform ation etc. derived from 
the per-step data flow graphs. This m arks the end of the tasks perform ed by the scheduler. The 









reg [2]; // e2(in)
add[0] { inl=reg[1].out;
in2=reg [ 2 ] .out;
conf='ADD_SUB_SI; } // e3
const[0].conf=4; // e4
shift [0] { inl=add[0].out;
in2=const[0].out;
conf= 'SHIFT_SLL_SI; } // e5
reg[3].in=const[0].out; // e4(out)
reg [0] .in=shift[0] .out; // e5(out)
const [1] .conf=l; // e6
reg[5].in=add[l].out; // e l (in), s8(out)
add[l] { inl=reg[5].out;
in2=const[l].out;
conf='ADD_ADD_SI; } // e8
const[2].conf=3; // e9
reg[7].in=mul[0].out; // elO(in), ell(out)
mul[0] { inl=const[2].out;
in2=reg[7].out;
conf='MUL_MUL_SI; } // ell
reg [4] .in=mul[0] .out; // ell (out)





reg[8].in=mul[0].out; // el2 (in), el3(out)
mul[0] { inl=reg[8].out;
in2=reg[4].out;
conf='MUL_MUL_SI; } // el3
const[0].conf=5; // el4
reg[6].in=add[0].out; // el5(in), el6 (out)
add[0] { inl=reg[6].out;
in2=const[0].out;




// Critical path : 8.Ins
Figure 4.7: The abstract netlist resulting from the schedule shown in figure 4.5.
To get m ore accurate tim ing results, and to be able to  program  a physical array, the configuration 
o f  the interconnects needs to be defined. C ontinuing the exam ple to this end, w e could run 
the abstract netlist through the allocation and routing tool, to produce a fully qualified netlist. 
This constructs the paths betw een the active cells according to the connectivity  defined in the 
abstract netlist. The various operations can be re-allocated  to d ifferent instances o f  cells o f the 
sam e type, that are closer together.
59
Scheduling
The fully qualified netlist can be directly converted into a configuration bitstream  for the physi­
cal hardw are. The physical configurations would look som ething like those shown in figures 4.8 
and 4.9. For the purposes o f illustration, the allocation in these exam ples is the sam e as in the 
abstract netlist.
Resource re-allocation can only be done safely for cells that have no state. Norm ally, registers 
should be non-relocatable, because they m aintain state. Since the scheduler is responsible for 
allocating registers, and since it is not routing aw are ,6 the registers that it chooses are arbitrary. 
As a result, this causes serious routability  problem s. However, by using the global register rea l­
location inform ation created by the scheduler, the allocator can reallocate registers, drastically 
im proving the situation. A llocation & routing, and bitstream  generation are outside the scope 
o f this thesis.
CONST r7 r8
Data path 1 : 
b m  e1
nmtm e2
SBS33 e3 
« M »  e4
b m  e5
Data path 2:




m b  e9
m b  etO 
m b  e11
Figure 4.8: The first step of basic block b l o c k l  (label b lo c k l ) ,  mapped onto the array.
(>and it would be distinctly non-trivial to m ake it so.
60
Scheduling
CONST r7 r8 r1
Data path 3: 
mssmi e11 
■ ■  e12 









4.3 Scheduling Stages Overview
^ D F ^ A n a l y s i s ^
^ ^ F ^ to a ly s i s ^
( Live Register A  Identification J
^ P a ra l le ^ ^ o ^ ^
^ ^ P i p e l i n i n g ^ ^
( Resource A  Configuration J
(Global Register \  Reallocation J
Figure 4.10: The tasks performed by the scheduler—stages to convert from assembly to abstract 
netlist.
The task o f converting from  assem bly to abstract netlist is broken into the follow ing stages:
Linking: M ultiple assem blies (from  user code and system  libraries) are m erged together to 
form  the com plete program , and dead code is identified and stripped away. D ata sym bols 
are allocated to physical addresses in data memory. Section 4.4.
DFG analysis: An internal representation is created for each basic b lock o f the program , where 
the list o f assem bly instructions is converted into a data flow graph (DFG). Registers in 
the assem bly becom e w ires, except w here they bring a value into the basic block (term ed 
input registers), or a value out o f  the basic block (term ed output registers). Section 4.5.
CFG analysis: The list o f possible jum p targets (basic blocks) is determ ined for each basic 
block in the program . Section 4.6.
Live register identification: The inform ation derived from  the CFG  is used to determ ine which 
o f  the input and output registers in each basic block actually bring useful data into and 
out o f each basic block. This helps w ork around the main draw back o f operating from  
assem bly: the assem bly provides no way o f explicitly m arking a register as no longer 
containing useful inform ation, and thus available for storing another value. Section 4.7.
62
Scheduling
Parallélisation: Each basic block is scheduled into a series o f  one or m ore steps (configuration 
contexts), by m apping the data flow graphs onto cells in the array. C onstraints identified 
in D FG  analysis m ust be adhered to, and the live register inform ation extracted  from  
C FG  analysis is used to m axim ise the num ber o f  registers available for use as tem porary  
registers to  b ridge partial results across step boundaries w here data paths have to be split. 
This is the key area o f  the scheduler. A  b rief overview  o f this stage is given in section 4.8, 
how ever a full descrip tion  is presented in a section on its ow n— section 4.9.
Pipelining: T he parallelised  data paths can be pipelined at this stage, although it is best per­
form ed on a routed netlist, w here the in terconnect delays o f each individual path are 
know n. T hat is the scope o f chapter 5.
Resource configuration: T he schedule is w ritten out in the form  o f a netlist, w hich involves 
som e additional special-case logic to com plete the configuration, such as calculating step 
duration, m em ory access tim ing field generation, and m em ory cascading.
Global register reallocation information: A fter scheduling is com plete, the roles o f  all the 
registers have been decided, so it is then possib le to track the flow o f data through all 
registers throughout all paths o f possib le control flow in the entire program . This in ­
form ation is then w ritten to the netlist, to allow the routing tool to  freely reallocate the 
registers in  order to im prove routing efficiency. Section 4.12.
T he linking and CFG  analysis stages are sim ilar to those found in the D IA B LO  optim ising 
linker fram ew ork [76], although they w ere developed independently. The w ork here adds m ore 
detailed live register identification, w hich is unnecessary w ith traditional m icroprocessors, but 
very im portant for data path architectures. W ork published on D IA B LO  gives an idea o f the 
sorts o f  optim isations that are possib le w hen w orking at this level. Som e o f these optim isations 
(such as basic b lock  m erging, conditional branch m erging, constant propagation , and constant 




In conventional C tool chains, the task o f  creating an executable binary is split into two stages: 
com pilation and linking. C om pilation consists o f converting C code from  a com pilation unit7 
into assembly. This assem bly is then assem bled, to create a binary object file. M ultiple object 
files are then linked together to form  the com plete executable. This tw o-layer process has the 
advantage o f introducing scalability— the task o f  com pilation requires m uch more processing 
tim e and m em ory than linking, so being able to split an arbitrarily large program  into m ulti­
ple pieces, each com piled individually (term ed increm ental com pilation), allows the m em ory 
requirem ent to be lim ited to that o f  a  single assembly.
The com piler assigns globally unique sym bolic nam es to  each function, each basic block, and 
each global data sym bol8 in the program . It does not bind physical (static) addresses to these; 
that is the task o f the linker.
The differences betw een the target reconfigurable architecture com pared to regular m icropro­
cessors becom e apparent when deciding upon a binary object file form at. The task o f preparing 
a program  for execution on a regular m icroprocessor, from  the assem bly output o f the com piler, 
is sim ple— the assem bly m nem onics directly translate into binary instruction patterns, and the 
hardw are is designed to execute sequences o f  these. However, the task o f preparing a program  
for execution on the target architecture— i.e. the tasks o f scheduling, allocating, and routing—  
is non-trivial, and not practical to perform  at run-tim e. Furtherm ore, the com bination o f  this 
and the H arvard-style m em ory architecture— w here the program  m em ory is physically separate 
to the data memory, and is strictly read-only from  the perspective o f the target device— m akes 
dynam ically generated code im practical. Also, dynam ic linking offers little advantage in  the 
sort o f em bedded applications that are targeted. This m akes it difficult to use existing object 
file form ats, and the features that they provide are unnecessary and/or inappropriate.
As a result, a custom  binary object file form at was developed, that shares m ore in com m on 
with the hardw are world— a netlist. The concept o f a  netlist was extended to represent m ultiple 
configuration contexts, and given ways to represent the initialisation values for certain  areas 
o f  memory. The scheduler tool takes in assem bly files and perform s the role o f  an optim ising 
linker, to create a single netlist w hich is then passed into the hardw are-dom ain tools (m apper 
and bitstream  generation).
F igure 4.11 shows an exam ple o f the assem bly form at em itted by the GCC com piler R IC A  
back-end, w ith the im portant features highlighted. Nam ing conventions are used for each sym ­
bol type, to make lexical scanning easier.
So, the conventional tasks o f a linker are perform ed: the assem bly is analysed, and the live 
functions and data sym bols are identified. D ead functions and data sym bols are stripped. Live 
data sym bols are statically bound to fixed addresses in data memory, and a symbol table is 
constructed to describe this m apping.
7a C source file plus any headers it includes.
Resulting from  global and static variables in C.
64
Scheduling
.section program _rom // Code.
.align 4
.global _main
.proc _main // Beginn ing of function '_main'.
_main:
CONST out= rl2 conf= -12
ADD out= rl3 inl= rl in2= rl2 conf= 'ADD_SI
WMEM in= r2 in_addr= rl3 in_off= 4 conf= 'WMEM_SI
CONST out= w33 conf= @L404 // Absolute address of another block.
JUMP in_addr= w33 conf= 'JUMP_ALWAY S
Lll :
// Accessing the 4th entry of a global array.
RMEM out= r3 in_addr= !_g_matrix in_off= 12 conf= 'RMEM_SI
MOV out= r23 in= r25
MOV out= r22 in= r3
L466
RMEM out= r9 in_addr= rl in_of f= 0 conf= 'RMEM_SI
RMEM out= r2 in_addr= rl in_off= 4 conf= 'RMEM_SI
ADD out= rl inl= rl in2= 32 conf= 'ADD_SI
JUMP in_addr= r9 conf= 'JUMP_ALWAYS // Return from function.
.endproc // End of function '_main'.
.section data_ram // Initialised read/write globals and static locals.
.align 4












.global _g_out // Global array initialised to zeroes.
_g_out:
.space 64
Figure 4.11: Example RICA assembly with the m ain  function (showing a few of its basic blocks), 
and some global data symbols.
Parallélisation (section 4.8) is perform ed on the basic b locks, to create configuration contexts 
(steps), w hich are the fundam ental unit in the program  memory. N ote that the step nam es 
rem ain sym bolic in the netlist, and are bound to physical addresses during bitstream  generation. 
This is because the size o f  each step may not be the sam e, e.g. if  program  stream  com pression 
is used, and the resulting size cannot be determ ined until after m apping is com plete.
65
Scheduling
4.4.1 Live Symbol Identification Algorithm
A fter m erging the assem blies into a flat nam espace, linking begins by identifying the program  
entry point (boot strap). This is established by m eans o f a nam ing convention— a function with 
a particular nam e is looked for. The program  entry point is initially considered to be live.
Then the follow ing iterative process begins: the functions that w ere newly identified as live 
are scanned. This involves looking through each instruction o f each basic block w ithin that 
function. The operands o f  the instruction are scanned for literals o f particular types (determ ined 
by a nam ing convention):
Basic block labels: These are identified by the lexical scanner by the ‘0 ’ or *&’ prefix (for 
relative and absolute addressing m odes, respectively). These are assum ed to represent 
the direct target o f the jum p at the end o f the current basic block, or if  subsequently 
stored into data memory, represent a function pointer— i.e. a potential jum p target for 
any indirect jum ps anyw here in the program . If any such label m atches the nam e o f  a 
function, then that function is considered to be live .9
Data sym bol labels: These are identified by the lexical scanner by the ‘ ! ’ prefix. This m arks 
that sym bol as having been referenced, and therefore live. Only live data sym bols are 
assigned physical addresses and included in the sym bol table.
The process ends w hen no new live functions w ere discovered during the last pass.
This algorithm  is low-cost, and the rather coarse inform ation identified here is sufficient for 
the intended purposes. Since the flattened assem bly can be quite large , 10 this is an im por­
tant consideration in order to m aintain scalability. The later stages (DFG A nalysis and CFG 
A nalysis— sections 4.5 and 4.6) refine this inform ation, by which stage they are operating on a 
m uch sm aller am ount o f data (i.e. m ostly only live code).
There is room  for optim isation in the assignm ent o f physical addresses to data sym bols: align­
ment and locality o f reference can have a significant effect on the throughput o f accessing that 
memory. M em ory access patterns can be analysed and feedback-derived optim isation can be 
applied to relocate data sym bols to achieve a m ore optim al physical address assignm ent. FIow- 
ever, this has not yet been explored in this work.
9 as the first basic block o f  each function is nam ed after the function, by convention.




The com piler describes a program  in term s o f  basic blocks, w hich are the fundam ental unit o f 
control flow. T he instructions w ithin a basic block are intended to be executed in sequence, 
w ithout in terrup tion  (branching). A basic block ends either by passing control d irectly  to the 
next basic  block in sequence, o r by jum ping  to another basic block. The choice betw een these 
can be conditional, i.e. the choice o f w hich basic block to execute next can depend on values in 
the data paths, and thus depends on the current state o f the m achine.
T he un in terruptib le nature o f  a  basic block m akes them  effectively describe a fixed circuit 
consisting  o f  one o r m ore data paths. The d a ta flo w  graph  (DFG) is a graphical representation 
o f  these data paths. T hese circuit descriptions can be used to generate configuration contexts for 
the reconfigurable core. Each instruction represents an active cell in the core, and the registers 
used as operands and results in the assem bly describe the connections (w ires) betw een these 
cells. C onnections that have no start point o r end point indicate that these values com e directly 
from /to physical register cells in the core.
blockl:
ADD out = rO ini = rl in2 = r2 conf = 'ADD_SUB_SI
CONST out = r3 conf = 4
SHIFT out = rO ini = rO in2 = r3 conf = ' SHIFT_SLL_SI
CONST out = r 4 conf = 1
ADD out = r5 ini = r5 in2 = r4 conf = 'ADD_ADD_SI
CONST out = r2 conf = 3
MUL out = rl ini = r2 in2 = rl conf = 'MU L_MU L_SI
MUL out = r8 ini = r8 in2 = rl conf = 'MUL_MUL_SI
CONST out = r 4 conf = 5
ADD out = r 6 ini = r 6 in2 = r4 conf = 'ADD_ADD_SI
MOV out = r4 in = rl
MOV out = r2 in = r8
Figure 4.12: Example assembly for a basic block. This example contains 4 independent data paths.
Data path 1 Data path 2 Data path 3 Data path 4
Figure 4.13: Data flow graph (DFG) extracted from the assembly in figure 4.12.
Figure 4.12 shows an exam ple basic block, in w hich analysis reveals four independent data 
paths, as shown in figure 4.13. F igure 4.14 reform ats the assem bly to show w here these data 
paths com e from.
67
Scheduling
// Data path 1
ADD out = rO ini = rl in2 = r2 conf = 'ADD_SUB_SI
CONST out = r3 conf = 4
SHIFT out = rO ini = rO in2 = r3 conf = 'SHIFT_SLL_SI
// Data path 2
CONST out = r 4 conf = 1
ADD out = r5 ini = r5 in2 = r4 conf = 'ADD_ADD_SI
// Data path 3
CONST out = r2 conf = 3
MUL out = rl ini = r2 in2 = rl conf = 'MUL_MU L_SI
MUL out = r8 ini = r8 in2 = rl conf = 'MUL_MUL_SI
// Data path 4
CONST out = r 4 conf = 5
ADD out = r6 ini = r6 in2 = r4 conf = 'ADD_ADD_SI
// Data path 3 (continued):
MOV out = r 4 in = rl
MOV out = r2 in = r8
Figure 4.14: Assembly instructions from figure 4.12 grouped by which independent data path they 
belong to.
Edge Value represented O utput registers
e l input from  rl rl
e2 input from  r2 -
e3 result of: rO <- r l  ADD r2 -
e4 result of: r3 <- CO N ST 4 i-3
e5 result of: rO <- rO SH IFT r3 1-0
e6 result of: r4 <— CO N ST 1 -
e l input from  r5 -
e 8 result of: r5 <- r5 ADD r4 i-5
e9 result of: r2 <- CO N ST 3 -
elO input from  r l -
el 1 result of: r l  <- r2 M U L r7 i-4, i-7
e l 2 input from  r8 -
e l3 result of: r8 <- r8 M UL r l i-2 , i-8
e l4 result of: r4 <- CO N ST 5 -
e l5 input from  r6 -
e 16 result of: r6 <- i-6 ADD i-4 i-6
Table 4.3: All edges from the example data flow graph in figure 4.13, and the corresponding in­
struction or register.
Since the basic blocks can be arbitrarily large, it may not be possible to fit all the instructions 
into a single configuration context, in w hich case scheduling is needed to determ ine how to best 
distribute them am ongst a num ber o f  configuration contexts. This is described in sections 4.8 
and 4.9. However, before this is possible, a data model is needed to represent this arbitrarily 
large data flow graph (DFG). This internal representation can be thought o f as the instructions 
m apped to a core with infinite resources.
68
Scheduling
The internal representation  o f  the data flow graph is defined in term s o f D FG  edges  (or m ore 
specifically, the concept o f  hyper edge [77]). A DFG edge represents a p iece o f inform ation 
created by an operation, and passed as input to one or m ore dependent operations. Specifically, 
an edge is created  for the following:
input registers: An input register brings a value into the basic block from  a basic b lock  that 
w as executed  previously (possibly the sam e block, for kernels). This value is represented  
by an edge.
instructions: An instruction generates a new value (the result) inside the basic block. This 
value is represented  by an edge.
im m édiates: Im m édiates that initialise a register directly, m ust be represented  by an edge, in 
case that register is an output register.
Table 4.3 show s the edges derived from  the exam ple assem bly in figure 4.12. A dditionally, 
in the data m odel, the D FG  edge stores a list o f registers w hich should hold the value o f  this 
edge upon exit from  the basic block (i.e. output registers). For each edge, a list o f  edges that 
m ust be calculated  before it (predecessors), and a list o f edges that m ust be calculated  after it 
(successors), are stored. This relationship is shown in figure 4.15.
Figure 4.15: Example data path with a particular DFG edge (e26) highlighted. Its immediate 
predecessors (e23 and e24) and successors (e27) are identified.
By abstracting aw ay from  the sim ple register-to-register transfer m odel captured by the assem ­
bly, the inherent parallelism  in the data paths becom es im m ediately  apparent. However, certain 
instructions have side-effects w hich require special treatm ent: e.g. m em ory access operations 
effectively describe data paths— and therefore connections— that are independent o f  the reg is­
ters, w hich im plies a dependency that is not captured in this data m odel.
69
Scheduling
To describe these additional connections/dependencies, constraints are stored as part o f the data 
m odel. A constraint describes a relationship betw een two DFG edges, and currently can be one 
o f the following:
sam e step or later: The D FG  edge on the left-hand side o f the relationship (LFIS) must appear 
in the sam e step or a step later than the step containing the D FG  edge on the right-hand 
side (RHS).
som e step later: The D FG  edge on the LHS m ust be scheduled in a step later in the sequence 
than the step containing the D FG  edge on the RHS.
The main data m em ory interface o f  the target architecture perform s m em ory read operations 
inline— i.e. the result is m ade available w ithin the sam e step. However, m em ory w rite op­
erations alter the state o f  the m em ory only at the end o f the step. Therefore, any potentially
aliasing m em ory operations11 must be placed in different steps. In this case, a  som e step later
constraint is placed betw een the read operation (LHS) and the aliasing w rite operation (RHS). 
Sim ilarly, the order o f  any sequences o f w rites to potentially  aliasing m em ory locations m ust 
be preserved. This is again done using constraints, betw een those m em ory w rite operations. 
D ifferent m em ory interfaces w ith other m em ory access behaviours can also be m odelled in a 
sim ilar manner.
The resulting data model therefore consists of:
•  The list o f DFG edges for the basic block.
•  The list o f constraints involving pairs o f these edges.
4.5.1 DFG Analysis Algorithm
Prerequisites: T he list o f instructions belonging to a live basic block.
Results: An internal representation o f the data paths o f that basic block, w ith the inherent 
parallelism  exposed.
The algorithm  perform s a single pass over the instructions o f the basic block, in natural order. 
A record is m aintained o f which edge represents the last value stored in each register. This 
record is called the register map.
For each instruction, the operands are scanned. First, each register nam ed in the instruction’s 
inputs is looked for in the register map. Any registers that do not yet appear in the register 
map represent input registers, so a new edge is created to represent the value read from  that 
input register. For registers that do appear in the register map, the corresponding edge for that 
register is m arked as a predecessor o f the edge that will be created for this instruction. Once 
all inputs have been inspected, an edge is created to represent the output o f this instruction, 
and is recorded in the register m ap against the register nam ed for the instruction’s output in the 
assembly.
11 i.e. a write follow ed by a read from the sam e address, or what could be the sam e address.
70
Scheduling
T here is one special case w here the behaviour is different: the move (MOV) instruction. A move 
is v irtual— there is no corresponding cell in the physical core. A m ove instruction provides two 
capabilities that could not be expressed otherw ise:
Fan-out: A m ove instruction w ith both input and output being nam ed registers, represents 
a  fo rk  in the data  flow graph— w here the sam e value is propagated  to m ore than one 
successor. This is the only way to represent fan-out to m ultiple output registers. N o edge 
is created  for the m ove instruction in this case; instead the entry in the register map is 
duplicated  for the new register (i.e. both entries refer to  the sam e edge).
Im m ediates: A  m ove instruction w ith the input being a literal, represents loading an im m edi­
ate into a  register. An edge is created to represent this im m ediate (w hich is treated sim ilar 
to an input register).
T he m ove instructions constituting the continuation o f data path 3 in figure 4.14 are an exam ple 
o f  fan-out to output registers.
W here an edge representing  an im m ediate is seen as a predecessor, the im m ediate is copied to 
the instruc tion’s input port, and the edge no longer considered to be a predecessor.
W hen instructions w ith special side-effects are encountered, appropriate constrain ts are created. 
K now ledge o f  w hich constrain ts to create is hard-coded into the tool. R ecords m ay need to be 
kept for certain  cases, e.g. for potentially  aliasing m em ory accesses, a  record sim ilar to the 
register m ap is m aintained for the edges that represent m em ory access operations that could 
alias to a particular address range.
O nce all instructions have been inspected, the final contents o f  the register m ap describes the 
output registers from  the basic block. As will be d iscussed in section 4.7, the assem bly has no 
direct w ay to state w hen the value stored in a  register becom es unim portant. T herefore, this set 
o f  output registers will be a superset o f  the registers that actually  pass inform ation into basic 
b locks that will be executed la ter in the program . The live register identification algorithm  
presented  in section 4.7 is used to prune this set o f output registers to a  subset c loser to the true 
set o f  live ou tput registers. N ote that if  any edges representing  im m ediates rem ain in the register 
m ap, then these represent in itialising an output register using an im m ediate, so are preserved in 




The program  control flo w  graph  (CFG) is a graph describing how control flow may pass be­
tw een the basic blocks o f  a program . The nodes are the basic blocks. The graph is directed, and 
potentially cyclic— due to the presence of loops. This analysis is static— it doesn’t execute the 
program , or use statistics obtained by executing the program . Therefore, no w eights are applied 
to  the edges o f the graph, m eaning that the graph does not contain inform ation about how often 
or how likely a particular path is followed.
The result o f  this analysis is a list o f  all possible jum p targets (basic blocks) for each basic 
block in the program . This inform ation is o f  vital im portance for live register identification 
(section 4.7), w hich greatly eases the parallelisation phase (section 4.8); and for the production 
o f  global register reallocation inform ation (section 4.12), w hich can be used by the m apper tool 
to reduce path lengths and congestion on the final routed configuration contexts.
T he control flow graph is com plicated by the following:
R ecursion: Loops and other form s o f  recursion introduce cycles in the control flow graph.
R eturning from  functions: Calling a function is straight-forw ard— the first basic block o f the 
function is recorded as a jum p  target for the basic block. However, since each function 
could be called from  m ore than one p lace , 12 there will be m ore than one return point. 
Furtherm ore, within the function, there can be m ore than one basic block that returns. 
See figure 4.16.
Function pointers: Jum ps to program  addresses obtained by dereferencing a function pointer 
m ust be identified, so that the resulting function call can be correctly taken into account.
_reset (bootstrap) \
Figure 4.16: Example program control flow graph with three functions (two user-defined, plus 
bootstrap), shown by the dotted red outline. The basic blocks within each function 
are shown. The program entry point is shown in bold. The function _ fu n c is called 
from two places in jn a in , and contains two blocks that return from the function. 
This results in multiple jump targets.
’'o therw ise  the function should have instead been inlined.
72
Schedu ling
4.6.1 CFG Analysis Algorithm
P rerequisites: Inform ation as to w hich functions in the program  are live (reachable) during 
execution, w hich functions have been referred to by address in any live code (section 4.4), 
and identification o f  the value supplying the address to the jum p instruction (section  4.5).
R esults: T he list o f all possib le jum p  targets (basic b locks) for each basic block in the program .
T he basic visitation algorithm  is sim ple: each live function in the program  is visited, in arbitrary
order. For each live function, the basic blocks belonging to the function are visited recursively
in the order that they call each  other, beginning w ith the first basic block in the function (the
entry  point). T he control flow behaviour o f  each basic b lock  can one o f  the follow ing:
No jum p: The basic block passes control directly to the next basic block in sequence, w ithin 
the function (e.g. jnain in figure 4.16). The next basic block is recorded as the sole 
jum p  target o f  the current block, and the next block is visited.
U nconditional jum p: The basic block passes control directly to a nam ed basic block w ithin 
the sam e function. T he nam ed block is recorded as the sole jum p  target o f  the current 
b lock, and the nam ed block is visited.
C onditional jum p: The basic block either passes control to the next basic block in sequence, 
or passes control to a nam ed basic block (e.g. _func and L3 in figure 4.16). Both targets 
lie w ithin the sam e function, and are m arked as the only jum p  targets fo r the current 
block. Both o f  these blocks are then visited, in turn.
Function  call (direct): The basic block unconditionally  passes control to the first basic block 
in the nam ed function (e.g. .reset, LI and L2 in figure 4.16). This is identified by 
the fact that the target basic block is not in the sam e function as the current block. The 
current b lock is recorded as a function call, fo r la ter reference. T he target block is m arked 
as the sole jum p  target o f the current block. However, the called function  is not visited 
at this stage; the next block in sequence from  the current block is visited instead, as this 
w ill be the return point.
Function  call (indirect): The basic block unconditionally  passes control to an address ob­
tained by dereferencing a function pointer. The target block could be the first basic block 
o f  any function w hose address has been referred to in live code (candidates are provided 
during linking— section 4.4). The current block is recorded as a function call, for later 
reference. All blocks identified as potential function  pointer targets during linking are 
recorded as the jum p targets o f the current block. N one o f  these are visited, and instead 
the next block in sequence from  the current block is visited, as this will be the return 
point.
R eturn from  function: The basic block unconditionally  passes control to the address stored in 
the link register (e.g. retl, ret2 and ret3 in figure 4.16). The block can pass control 
to the next block in sequence from  each o f the places that the function was called. These 
are not know n at this stage, so the block is recorded for later processing. This m arks the 




Since the control flow graph is typically cyclic, an end point has to be enforced to prevent 
infinite recursion inside loops. This is done by keeping a record o f  w hich basic blocks have 
already been visited, and returning im m ediately if  the current basic block has already been 
visited.
O nce the visitation function returns from  the first basic block in the function, the next live 
function is visited. O nce all live functions have been visited, the potential jum p targets can be 
finalised for each basic block that returns from  a function. This is done by consulting the list 
o f w hich basic blocks called that function .13 The next block in sequence from  each o f these 
callers is recorded as a jum p target o f the blocks that return from  functions.
In reality, if  there are m ore than one basic block that returns from  a function, then not all of 
these will necessarily be reached for each caller. But it is a safe sim plification to  assum e that all 
return points return to all callers. The side effect o f this sim plification is m ore paths o f control 
flow to explore during live register identification, which could lead to additional registers being 
needlessly m arked as live (see section 4.7).
As noted above, to deal w ith function pointers, a safe assum ption is made: all functions w hose 
address is referred to in the program , are considered to be jum p targets for each basic block that 
perform s an indirect jum p. An indirect jum p is identified by the jum p address being obtained by 
dereferencing a function pointer. This assum ption is unlikely to m atch reality, and as a result, 
many registers may be incorrectly seen as being live. This is safe, but again, is potentially 
inefficient.
Indirect branches can also occur to internal labels, e.g. via a long jum p. These should be treated 
in a sim ilar m anner to function pointers, w here the address o f the basic block corresponding to 
that internal label will be taken in the code and stored in m em ory (on the stack).
As a result o f the traversal, CFG analysis also yields further dead-code stripping, since any 
basic block not visited during CFG  traversal is, by definition, dead. Easy access to the control 
flow graph makes a w ealth o f  assem bly-level optim isations possible, w hich are perform ed prior 
to parallelisation. These however are beyond the scope o f  this thesis.
l3or could have called that function, particularly when function pointers are involved.
74
Scheduling
4.7 Live Register Identification
R egisters are at a  prem ium  during the parallélisation  phase (section 4.9), and subsequently  for 
p ipelin ing (chapter 5). W ithout further inform ation, the only safe way to allocate additional reg­
isters is to choose those from  a pool o f  registers set aside explicitly  for storing tem porary values 
(,scratch registers). However, setting aside too many registers causes problem s elsew here, such 
as m aking the com piler m ore likely to generate sm aller basic blocks, w hich are harder to paral­
lelise; and setting aside too few increases the chance o f  register starvation, w hich again lim its 
the extent to w hich the basic blocks can be parallelised. This section proposes an algorithm  for 
im proving on the estim ate o f w hich registers are really live across basic block boundaries, thus 
m aking m ore registers available for other uses.
A  lim itation o f w orking from  the assem bly is that the assem bly only describes w hen registers 
are w ritten to (and hence becom e live), but not when their stored value becom es unim portant 
(w hen they becom e dead). This is a problem  w hen using the com piler to target the RICA  
hardw are, because the com piler uses registers to pass values betw een the instructions w ithin a 
basic block, and m any o f these values are not used outside o f  the basic block, w hich leads to 
resources being needlessly  tied-up. T here is no d irect w ay to  determ ine w hich values are only o f 
relevance in ternally  to the basic block, and w hich are used to convey inform ation betw een basic 
blocks (ou tput registers). A lso, som e registers that are inactive in a basic block m ay be storing 
inform ation for use later in the program  (dorm ant registers). These too m ust be identified.
LBasicBlockl:
OP1 out=r4 inl=r2 in2=r3 
OP2 out=r3 inl=r4 in2=r5 
OP3 out=r5 inl=r7 in2=r9 
OP4 out=r2 inl=r5 in2=r9 
MOV out=r9 in=r6
(a)




r2 /y / / 'fé/Z/A r 2
r3 zjy, r3
> r< Ç ; r4 ZZZZ/z,











r2: Input, dead, output
r3: Input, dead, output
r4: Dead, clobbered, dead
r5: Input, dead, clobbered, dead
r6: Input, live on exit
r7: Input, dead
r8: Dormant
r9: Input, clobbered, output 
r10: Reserved for scratch 
r11: Reserved for scratch 
r12: Reserved for scratch
(0)
Figure 4.17: (a) Example basic block assembly, (b) Registers read from (input) and written to 
(output) in the assembly, (c) Corresponding register lifetimes when executing the 
instructions in the assembly, line by line. The dormant registers and dead output 
registers cannot be determined by looking at just this basic block alone (and hence 
are missing from (b)).
Figure 4.17 highlights these concepts, show ing how the state o f  the m achine changes during the 
execution o f  a sim ple basic block (figure 4.17(c)). N ote that the m achine here is w hat the com ­
piler thinks is the target architecture, w hich is a sim ple R ISC -like register transfer m odel; not
75
Scheduling
the real reconfigurable hardware. A solid fill (cyan or grey) indicates that the register is storing 
useful inform ation, and a hatched fill indicates that the value stored in the register is no longer 
needed. In this exam ple, registers r2 , r3 , r5 , r 6, r 7, and r9  are input registers (shown in green 
in figure 4.17(b)), w hich bring data into the basic block for use in com putations. D uring the 
process o f the com putations, som e o f  these values becom e dead (i.e. their value is not needed 
thereafter), and are overw ritten by new values (clobbered— shown in red in figure 4.17(b)). r 2, 
r3 , r4 , r5 , and r9  are w ritten to in the basic block, but out o f these, only the final values of 
r2 , r3 , and r9  are needed in later basic blocks (i.e. are output registers), r l  and r 8 are not in­
volved in any operations in the basic block, but are storing inform ation used later on (dorm ant). 
Therefore, the registers r 4, r 5, r 7, in addition to the reserved scratch registers r lO , r l l ,  and 
r l 2 ,  are available for other u se s14. In this case, this doubles the am ount o f  registers available 
(i.e. 6 com pared to 3). This is quite representative o f the situation in general.
The com piler largely know s the register lifetim e inform ation, through a com bination o f  the A p ­
plication  Binary Interface  (AB I) definition, and internal inform ation it stores for the functions 
and basic blocks as it form s them . H ow ever due to internal im plem entation details, and overly 
conservative assum ptions used in the com piler, it proves difficult to reliably pass inform ation 
o f sufficient quality into the RICA  GCC back-end. This is m ainly due to what inform ation is 
discarded before entering the G CC back-end, and how the com piler works on a per-function 
basis, assum ing that every caller saved register it modifies in that function needs to be saved 
on to the stack, irrespective o f  w hether it was live in any control paths leading to that function. 
This is prim arily a result o f the support for increm ental com pilation, and the ability to call a 
function from  anyw here15. An alternative approach was needed.
T he key observations are as follows:
•  A value stored in a register is by definition live between when it was w ritten and w hen it 
is last read from.
•  A value stored in a register is by definition dead when it is overw ritten (clobbered).
This applies throughout the entire execution o f the program . However, since the particular 
execution control flow path follow ed during any given run o f a program  is arbitrary, usually 
non-determ inistic, and potentially cyclic, all possible paths must be considered. Since a single 
set o f configuration contexts is created for each basic block, they have to apply to each possible 
path in which they could be part of.
As a result o f parallelisation (section 4.9), values internal to a basic block that are stored in reg­
isters in the assem bly, becom e transferred through wires in the resulting configuration contexts. 
Therefore, only the state o f the registers across boundaries betw een basic blocks needs to be 
resolved.
To visualise this, it is useful to draw  representations o f the registers read from  and w ritten to in 
each basic block, in the form  shown in figure 4.17(b). W hen a register has only a green box, 
then that indicates that the value is live on entry to that basic block, and is not overwritten. 
The same value may or may not be live on exit. W hen a register has only a red box, then that
14such as tem porary registers, during parallelisation.
I5e.g. if  the function is to be part o f a library.
76
Scheduling
indicates that w hatever previous value was clobbered, so by definition is dead on entry to that 
basic block. T he new  value m ay or m ay not be live on exit. W hen a register has both  a green 
and a red box, then the previous value is live on entry, but has also been clobbered. T he new 
value m ay or m ay not be live on exit. By expressing all the basic blocks in this way, one can 
m anually  follow  each path o f control flow path, w atching w hen a particu lar register is read 
from  or clobbered. This is essentially  w hat the live register identification algorithm  does, and 
an exam ple can be seen later in that section.
4.7.1 Contribution: Live Register Identification Algorithm
Prerequisites: All possib le paths o f  control flow m ust be know n. This consists o f  know ledge 
o f  all possib le jum p targets for each basic block. This m ust include the effect o f  function 
po inters— i.e. all locations w hich they could  dereference to.
R esults: The set o f  registers live on exit from  each basic block.
T he program  control flow graph is a directed and potentially  cyclic tree. T he nodes are basic 
blocks. The inform ation is initially  scattered in pieces am ongst the nodes o f  the control flow 
graph, coded in the form  o f which registers provide input to and w hich registers are overw ritten 






Figure 4.18: Example program control flow graph. The program entry point is shown in bold. The 
program consists of two functions: m ain  (cyan) and fu n c  (magenta), where m ain  
calls fu n c  from two different places. Each function contains a loop.
An exam ple control flow graph is given in figure 4.18. The prom inent features o f  this exam ple 
control flow are shown in figure 4.19. T he exam ple consists o f  a boot strap (reset) w hich 
calls the function main, main contains a loop LI, L2, L3 (shown in figure 4.19(d)), and 
then returns to the bootstrap. This loop in main calls the function func tw ice— once from  
LI (figure 4 .19(a)), and once from  L2 (figure 4.19(b)). Inside func there is another loop, L4, 
show n in figure 4.19(c). This exam ple is com plex enough to dem onstrate the com m on issues 
in identify ing live registers: loops, and code that is com m on to m ultiple control flow paths (i.e. 
the func function being called from  m ore than one place).





Figure 4.19: Significant features of the example CFG from figure 4.18. (a) First call of the function 
fu n c : from LI of m ain, returning to L2. (b) Second call of the function fu n c: from 
L2 of m ain, returning to L3. (c) Inner loop: L4 in fu n c . (d) Outer loop: LI —> 
call(fu nc) —* L2 —> ca ll(fu nc) —> L3 in m ain.
78
Schedu ling
T he proposed algorithm  essentially  consists o f  visiting each node in the contro l flow graph, 
and ‘p ing ing ’ each register dow n the contro l flow graph stem m ing from  that point, listening 
fo r a response as to w hether that register has its value used, or w hether it has been clobbered. 
T he inform ation gathered at that node is stored, and once com plete, is passed back to the pre­
vious node. By perform ing the walk, the com plete picture is gradually  built up, and the final 
(com plete) inform ation can be seen to be returned from  the program  entry point.
T he w alk begins at the program  entry point (_reset in the exam ple). A t each step o f  the walk, 
a packet o f  inform ation is constructed  that w ill be returned to the caller (i.e. the previous step 
in the w alk). This packet o f  inform ation is the set o f registers found to be live on exit from  that 
basic block, according to the branch o f  the control flow graph stem m ing from  that node. This 
w ill be a subset o f the registers that are live on exit from  the basic block, w hen considering the 
entire control flow graph. O nce the control flow graph has been  fully  explored , 17 the com plete 
set o f  live registers on exit from  each basic block will be know n. The com plete trace o f  the 
w alk for the exam ple can be found in appendix B.
Basic block Input registers C lobbered registers
_reset - r l , i-2, r9
unain r l , r 2 ,  x-9 r l ,  r2, r5
LI r5 i-3, r4, i-9
L2 r5, r l  1 r3, r4, r9
L3 r5, rl 1 r6
re tl r l , r 9 r l ,  r2, r9
_func r l , r2 , r6 r l , r2 , r6
L4 r3, i-4, r6 r3, r6
ret2 r l ,  i-3, r9 r l ,  r2 , r6, r l  1
_end - -
Table 4.4: Register information for the basic blocks of the example in figure 4.18. The information 
is read directly from the instructions of each basic block.
T he instructions in each o f  the basic blocks in figure 4.18 yield the inform ation given in ta ­
ble 4.4. F igure 4 .20(a) shows this graphically.
Each step o f  the w alk consists o f continuing the w alk down each branch o f  the control flow 
graph stem m ing from  the current node. This is done by visiting each potential ju m p  target for 
the current basic block. T he current basic block is the caller to each o f  these, and as a  result, 
receives the packets o f  inform ation resulting from  the w alk dow n each o f  the potential jum p 
targets. O nce these packets o f  inform ation have been accum ulated  for the current basic block, 
construction  begins for the packet o f  inform ation that is to be passed to the caller o f  the current 
basic block (the LHS in the CFG  edges shown in tables B .l and B.2 in appendix B). This 
packet contains a list o f  additional registers that could be read from  as a result o f  going down 
this branch in the CFG. This list o f  additional registers consists o f  the list o f  input registers to 
the current basic block, and the list o f registers found to be live on exit from  the current basic 
block (so far) m inus those that are clobbered in the current basic block.
17i.e. w here each possib le way o f  visiting each node has been considered.
79
Scheduling
Figure 4.20: The example CFG from figure 4.18 showing (a) input registers (green) and output 
registers (red) read from the assembly (as shown in table 4.4), and (b) the registers 
determined by the algorithm to be live (cyan), and the output registers determined to 
be dead (black).
Basic block Registers live on exit Available as tem poraries
Before A fter
_reset r 1 , r2 , r6, r9 3 10
_main r l , r2, r5, r6 3 10
LI r l ,  r2, r3, r4, r5, r6, r9 3 7
L2 r l ,  r2, r3, r4, r6, r9 3 8
L3 r l ,  r2, r5 ,r6 , r9 3 9
ret 1 - 3 14
Ju n e r l ,  r3, r4, r5, r6, r9 3 8
L4 r l , r3, r4, r5, r6, r9 3 8
ret2 r l , r2, r5, r6, r9, r l 1 3 8
_end - 3 14
Table 4.5: Final record of registers live on exit from each basic block in the example in figure 4.18.
The effect on the number of registers available for use as temporaries inside each block 
is shown (where ‘Before’ can use only the scratch registers). There are 14 registers in 
total in this example, of which r 12, r l3 , and r l4  are reserved as scratch.
80
Scheduling
This update logic is shown in the rightm ost colum n in the tables, w here the new  return packet 
is form ed by taking the ca lle r’s current list o f  known live registers, and adding the input reg­
isters o f  the current block, and adding the live registers identified for the curren t b lo ck 18 with 
clobbered registers filtered out.
To deal w ith cycles in the control flow graph (i.e. to prevent infinite recursion), a record is 
kept for each basic block, as to w hich basic blocks it was visited from  during the w alk (i.e. 
w hich basic  b locks control flow passed from ). If  the sam e sequence o f  two basic b locks (i.e. 
C FG  edge) visited is repeated later in the walk, then control does not pass to  the potential jum p  
targets o f  that basic block; instead, the packet that is to be passed to the caller is constructed 
from  the previously obtained inform ation for the current basic block.
A t each step o f the walk, the live register inform ation is potentially  incom plete, depending 
on how that basic block was visited in the previous walks. The cycle avoidance m echanism  
particu lar affects this. To ensure that the inform ation is com plete, the entire w alk  m ust be 
repeated  until no new  inform ation is obtained. This is especially  im portant since the sam e 
cy c le 19 could appear in m ore than one branch o f the control flow graph. F or exam ple, if  a 
function contains a loop, that loop is a cycle that will appear in each branch o f  the control flow 
graph w here that function has been called. Each tim e the entire w alk is repeated, the record o f 
w hich basic blocks visit each basic block is cleared. However, the record o f live registers on 
exit from  each basic block is no t cleared.
Table 4.5 show s the final live register inform ation obtained from  the C FG  walk, collected  to ­
gether. F igure 4 .20(b) shows this graphically. It can be seen that all dorm ant registers have been 
correctly  identified, and a few registers that w ere w ritten to inside basic blocks w ere found to 
be dead on exit from  those basic blocks (shown in black in the figure).
As an optim isation note, the im plem entation o f  this algorithm  m akes use o f  a recursive function. 
The record  o f  w hich basic blocks visit each basic block and the record o f  w hich registers are 
live on exit from  each basic block are stored outside o f  the scope o f  the recursive function. The 
packets o f  inform ation are represented by variables (sets) stored on the stack, and are passed 
by reference to the recursive function. This way, the recursive function updates the value o f  
the variable in  the previous stack fram e. However, a further optim isation is possible: since the 
packet o f  inform ation is a subset o f  the final com plete inform ation for that node, and storage 
already exists fo r that com plete inform ation for the node, a reference to that storage can be used 
instead, and passed to the recursive function.
4.7.1.1 L im itations
T he algorithm  can falsely consider certain registers to be dorm ant, w hen in fact the inform ation 
stored there is never actually  needed. Furtherm ore, som e registers are falsely considered to be 
live before function calls. This particularly  affects the program  entry point (as can be seen w ith 
r 6 in _reset in figure 4.20(b)). This is a further side effect o f  our use o f  assem bly as the 
input data m odel. T he com piler considers all registers that it overw rites in the assem bly fo r a 
function  as being live, and therefore pushes their value onto the stack in the function prologue
l8after visiting all o f  its ju m p  targets.
I9i.e. the sequence o f  basic block nam es that were visited dow n one particu lar branch o f  the control flow tree.
81
Scheduling
( r 6 is used in the f u n c  function, and thus is pushed onto the stack in the basic block _ f u n c ,  
w hich causes r 6 to be seen as live all the way back to the program  entry point). However, as 
m entioned in previous sections, many o f the registers that the com piler uses in the assem bly 
becom e wires, and therefore are not clobbered in the resulting netlist (so don ’t really need to 
be stored). The scheduler currently does not treat stack pushes as being special in any way, and 
as a result, the register w hose value is pushed onto the stack is considered live on entry to the 
first basic block in the function, even if  it isn ’t subsequently overw ritten in the function. This 
live status passes backw ards through the control flow graph all the way until that register was 
last w ritten to .20 A lso, the subsequent popping o f the previous value from  the stack during the 
function epilogue sim ilarly results in all pushed registers being seen as live on exit from  the 
last basic block o f the function. This can be even worse, since functions can have m ultiple exit 
points— all o f w hich will have those registers m arked as live on ex it.21
a)
b)
M l . I t ! I 1 XV /  /
r2 /  -r2 r2 I ’ 7  X2/ / /  XV/ /
¡3 __ Q __ '  ,* y  ■ , r3 I ....7 7 3  * !
r4 1 Î ....... !............. ■ 1  1 r4 /  XV /
■■3 f.5 , r5 f5 r5
I r6 —  -0*- r6
a a r7 1 ■  1 r7
r.'J /  r8 / r8 a r8 ' / /  /X V / /
start L1 L2 L3 end
c)
! '1 J___ 1 n I m  I n
r2 r2 ¡■3 I „ /  f2 / ' /  x 2 ■ /
r3 ,....- r3 . f3 . . r3 | m \ r 3
ij........i- ' . .L ... I I É&S|r4 ■ X V  /
r5 ■ ■ -XiPT, r5 r5 r5 r5
T6 r6 ■ I r6 r6 r6 '
r7 a a m  I r7...2̂ ......... f8 r8 r8 r8
start L1 L2 L3 end
Figure 4.21: Example demonstrating a problem with identifying live registers in nested loops: th< 
algorithm incorrectly considers r 6 as live in LI, L3, and s t a r t ,  due to the presence 
of the outer loop. Where: (a) the input and output registers obtained from the as 
sembly, (b) the live registers identified by the algorithm, and (c) the registers that are 
reallv live.
" th is  is not the case in the exam ple— r(i is actually used to pass inform ation betw een basic blocks in the f u n c  
function, and so does get clobbered and m ust be stored.
21 in the exam ple, this causes r 6  to be falsely considered live in the block L 2.
82
Scheduling
Since the assem bly has no way to explicitly  define w hich registers are dead on exit from  a basic 
b lock, the live register identification algorithm  can m ark m any registers as falsely being  live 
w hen a b lock  appears inside a loop. e.g. if  som e registers are really live on exit from  a basic 
block inside the loop, but are not really live on entry to that basic block, then these are still seen 
as live on entry to that basic block. This is even the case if the registers are clobbered elsew here 
in the loop body. An exam ple o f this is given in figure 4.21. T here are tw o nested loops: L2 
w ith r  1 as the loop counter, and LI —> L2 —> L3 w ith r2  as the loop counter. Inside the inner 
loop, register r 6  is used to  pass a result to  the next iteration o f  the inner loop. However, due to 
the p resence o f  the ou ter loop, r6  is falsely m arked as live throughout Ll and L3 (and indirectly  
also falsely m arked as live in the block s t a r t ) .  Such false positives do not lead to incorrect 
program  behaviour, but sim ply prevent certain  registers from  being used fo r o ther purposes.
A ll o f  these problem s are the result o f  there being only one record  o f  live registers on exit 
from  each basic block. S ince som e registers are only live w hen follow ing certain  CFG  edges 
stem m ing from  that basic block, they propagate false positives down the o ther CFG edges 
stem m ing from  the sam e basic block. It m ay be possible to extend the algorithm  to take this 





The basic blocks that were determ ined to be live during CFG analysis (see section 4.6) are 
each scheduled into one or m ore steps (configuration contexts). This transform ation involves 
packing the independent data paths (determ ined through DFG analysis— section 4.5) o f that 
basic block into as few a steps as possible, each with as short a critical path length as possible. 
This packing is achieved by m eans o f  a scheduling algorithm , which is discussed in section 4.9. 
The second aspect o f this task— that o f assigning registers to bridge fragm ents o f data paths 
across the resulting step boundaries— is discussed here.
The resulting sequence o f  steps should have the sam e set o f input registers and (live) output 
registers as the original basic block. Flowever, any state changes o f registers internal to the 
basic block need not be honoured. In other words, only the state o f registers that are used to 
transfer values betw een basic blocks needs to be preserved.
If it is not possible to pack all the instructions from  a given basic block into a single step (con­
figuration context), then it will be necessary to infer new registers (term ed temporary registers) 
to store the values o f any data paths that span the boundary between steps.
In addition to this, registers m ust be assigned on each step boundary to store tem porary values 
across that boundary. Such values correspond to edges in the data flow graph that straddle 
across a boundary, when a data path cannot be packed entirely into a single step.
4.8.1 Temporary Register Assignment
A tem porary register is assigned to each value that needs to be stored across each step boundary. 
Even the final value w ritten to output registers in the assem bly are stored in tem porary registers. 
T he values are only transferred to the output registers in the last step. Similarly, values brought 
into the basic block via input registers are transferred into tem porary registers in the first step. 
This m eans that all registers active in the assem bly for that basic block, are available for use 
as tem poraries across each internal step boundary, and all values that exist across each step 
boundary are treated equally. This has the advantage o f sim plifying the problem , and also 
increases the num ber o f registers available internally, as will be discussed shortly.
Registers are assigned to store the values represented by edges in the data flow graph, w hich 
were identified during DFG analysis (section 4.5). Exactly one register is needed to store each 
edge that is live over a step boundary. Fan-out (if necessary) is accom plished in the subsequent 
steps, w here m ultiple operations read from  the same register. The calculation o f w hich registers 
are available for as tem poraries is done for each step. The following logic applies:
Input registers: Each input register is represented by an edge in the DFG, and these have to 
be bridged across step boundaries until the last step where that value is read. The value is 
transferred into a tem porary register on exit from  the first step. Note that the value may 
be live all the way through the basic block, in which case it will be transferred to one or 
m ore output registers.
84
Scheduling
Output registers: T hese can be used as tem poraries in all steps before the final value is w ritten 
to it. For conceptual sim plicity, output registers are only considered to be w ritten to in 
the last step; a tem porary register is used to bring their final value through all the steps 
since the one w here it was calculated.
To prevent needless register to register copying, tem porary register allocation gives som e edges 
precedence: tem porary  registers representing  values read from  an input register are kept in 
the sam e register as the input register, and tem porary registers representing values that w ill be 
w ritten to ou tput registers are kept in one o f the output registers that will receive that value. For 
all o ther tem porary values, the sam e register is assigned from  one step to the next so long as it 
is still available.
Figure 4.22: Example basic block (loop) showing: (a) the assembly, (b) the data paths extracted 
from the assembly, and (c) the data paths scheduled into steps, showing the edges that 
need to be stored (in temporary registers) over each step boundary.
Figure 4 .22(a) shows an exam ple basic b lock (L l) ,  and figure 4.22(b) the data flow graph 
extracted from  that assem bly. T he exam ple is for a 20 term  m ultiply accum ulate (M AC), w hich 
the com piler has chosen to partly roll into 4 M A Cs per iteration. The basic block form s a loop, 
w here the final result is obtained in r 5 .22 The com piler introduces r3  as the accum ulator, w hich 
it in itialises to zero before entering the loop. r 6 is the loop counter, w hich is increm ented (by 
4) on each iteration, and is used as the base address for each m em ory read (R M EM ) operation.
To reduce clu tter in the diagram , im m ediates are used for som e values (indicated  by num bers 
on the nodes) to  feed in constant values. r 4  supplies the m ultip lier operand, w hich is invariant 
over the iterations o f the loop (but m ay change betw een successive tim es the loop is entered).
“ although this will contain interm ediate results until the last iteration.
85
Scheduling
Its value is duplicated into another register ( r 8), for arbitrary reasons. Since r 4 is read from  in 
each iteration, its value must be preserved throughout; therefore it is a live output register. The 
edges are individually num bered— the order o f  w hich is determ ined by w here the corresponding 
instruction appeared in the assembly.
F igure 4.22(c) shows a possible resulting schedule o f the sam e basic block, when resources are 
lim ited to 2 ADD cells and 2 M U L cells. This results in 3 steps, where the total critical path 
has been m inim ised by distributing the chain o f adders across the steps, w hich m inim ises the 
critical path in each step. A lso note that the m em ory read operations are all perform ed in the 
first step, since storing their result in registers across the step boundary doesn’t increase the 
critical path o f the first step, but reduces the critical path o f the second step.
The dow n-side o f this schedule is an increase in  the num ber o f edges that are split across step 
boundaries. Since the assem bly only refers to r 2, 7’3 , r4 , r 5, r 6 and r 8, these are the only 
registers available as tem poraries (total 6). However, in this exam ple there are 3 registers and 4 
broken data paths needing to be stored over the first internal step boundary, requiring a total o f 7 
tem porary registers. This would require using a scratch register23 However, one o f the registers 
being stored is a duplicate o f  another. By w orking in term s o f  edges, and deferring output 
registers until the last step, registers are freed for each duplicate present. In this case, there is 
one duplicate ( r 8), so this frees up a register over both internal step boundaries, resulting in 
there being sufficient registers available as tem poraries, w ithout having to rely on scratch.
Note that in general, although this logic reduces the register count, the tem porary register count 
often exceeds the num ber o f active registers in the assembly. This requires the availability o f 
scratch registers, or know ledge about w hich inactive registers are dead in that block (obtained 
through live register identification— section 4.7). Similarly, know ledge o f w hich (if any) output 
registers are dead on exit from  the basic block can be used to avoid storing the value across 
internal step boundaries when not needed.




In order to  achieve the sm allest total critical path, all independent data  paths in the data flow 
graph should  be executed in parallel. In cases w here constrain ts (such as m em ory access pa t­
terns) m ake it im possib le to perform  all in parallel, additional configuration contexts are needed, 
and the data paths should be p laced in as early  a context as the constrain ts allow.
This represen ts the ideal case for execution speed. In m any sim ple cases, particularly  w ith 
large cores (w ith abundant resources), this can be achieved. However, in general, this m ay not 
be possib le for tw o reasons:
•  T here  may be insufficient instruction cell resources available in the core to  perform  all the 
operations in each configuration context defined by the constrain ts alone. This requires 
the insertion o f additional step boundaries (configuration contexts).
•  Even if  sufficient instruction cell resources are available, data paths split across step 
boundaries im posed by the constraints will require additional registers (referred to as 
tem porary  registers) to store the value o f each broken edge across the step boundary. If 
too m any edges are split in this m anner, there may be insufficient registers available to 
store all the values. If this happens, the solution is to insert additional step boundaries: 
forcing sm aller broken independent data paths to appear in later steps, instead o f  the 
norm al policy o f  as early as possible.
The first issue (resource starvation) is effectively dealt with by m eans o f  a ready list. The 
second issue (register starvation) is m ore com plex, and several strategies have been developed 
to  resolve it (described in section 4.10 on page 95).
4.9.1 Background: List Scheduling
T he algorithm  is based on list scheduling. List scheduling is an iterative scheduling heuristic, 
that perform s a non-exhaustive search o f the solution space. The goal is to pack the given 
instructions into a series o f execution slots, w here this packing has the m inim um  execution 
tim e on the target CPU  (or som e other metric). Successive execution slots are filled by taking 
entries from  the beginning o f  the ready list.
T he ready list is a subset o f  the instructions that are yet to be scheduled, and contains only those 
instructions that have no unresolved dependencies. Each entry is checked for constrain ts being 
satisfied, and if  so, the entry is added to the current slot in the schedule, and rem oved from  
the ready list. T he ready list is re-evaluated/re-populated each tim e an instruction is scheduled, 
since som e instructions may now have their dependencies resolved. The algorithm  then con tin ­
ues w ith the next slot. The algorithm  finishes once all instructions have been scheduled. The 
flow chart is given in figure 4.23.
L ist scheduling algorithm s can be applied to m ultiple issue architectures (such as V LIW s), by 
flattening the issue slots fo r each tim e unit into a single-dim ensional array o f  slots [32], The 
logic for calculating  w hether dependencies and constrain ts have been m et is altered accord­
ingly. This allow s the algorithm  to do things like populate branch slots [78]— slots after a
87
Scheduling
branch/jum p instruction, which allow  certain operations to be perform ed w ithin the internal 
latency o f the jum p. Similarly, the dependency and constraint logic can be modified to take 
account o f pipelined instructions, and can be an effective way o f reducing pipeline bubbles 
179],
Figure 4.23: Flow chart of generic list scheduling algorithm.
Since the ready list is always inspected in order, its order is very im portant. List scheduling 
algorithm s differ prim arily in how they define this order. The most com m on sort orders are 
related to m obility— i.e. the distance that the entry can be m oved w hilst satisfying its depen­
dencies. M obility can also be thought o f  as a m easure o f idle tim e or slack. O perations that 
lie on the critical path have a m obility of zero. Previous w ork on scheduling on the RICA 
architecture involved applying m obility-based list scheduling [57],
(a) (b)
Figure 4.24: Dependent and independent operations. Operations can either be (a) independent of 
each other, where any inputs and outputs are registers or constants, or (b) dependent 
on the results of one or more other operations. The latter is called operation chaining. 
The critical path is shown in bold in each case.
Scheduling
T hat previous w ork applied list scheduling to architectures that support operation  chaining—  
i.e. com binatorial data paths. T he concept o f  operation chaining is shown in figure 4.24, w here 
the sam e operations are connected  together in a d ifferent m anner in the tw o cases p resen ted24. 
W hen the operations are independent o f one another, these m ay be executed in parallel, if  the 
hardw are supports this. D ependent operations m ust be serialised on conventional architectures, 
but can be executed in the sam e cycle on hardw are that supports operation chaining. O pera­
tion chaining extends the critical path o f the configuration context, w hich reduces throughput. 
However, this can be com pensated for by p ipelining (see chapter 5).
B ecause the delays o f the various functional units and in terconnect differ, it isn ’t accurate to 
express the configurations in term s o f  slots o f d iscrete tim e. A key feature o f  that w ork was 
to  extend the concept introduced w ith m ultiple-issue architectures, w here m ultip le slots exist 
fo r the sam e unit o f time. T he execution slot dim ension is extended such that a slot is used to 
represen t each available instruction cell resource in the physical array. Then the tim e d im en­
sion is m odified such that rather than representing execution clock cycles, each slot represents 
an en tire  configuration that should be loaded and executed on the core. Each new tim e slot 
represents a new configuration that should be loaded and executed in sequence.
W ith this strategy, even though the instructions w ithin a configuration look like they are exe­
cuted in parallel (since they appear in the sam e ‘tim e’ slot), the instructions can be independent 
(executed in parallel) o r dependent (executed in sequence, com binatorially) on o ther instruc­
tions in the sam e configuration. The original data flow graph is used to determ ine the connec­
tivity betw een  the instructions w ithin a configuration. The scheduling algorithm  need only  be 
aw are o f  the connectivity by the effect it has on the critical path.
T here w ere two m obility-based orderings considered: as soon as possib le (A SA P), and as late 
as possib le (A LA P). These refer to w here in the ready list instructions w ith the highest m obility 
should appear. Instructions w ith the sam e m obility appear in the their original (purely sequen­
tial) order. A L A P has the effect o f giving precedence to instructions that lie on the critical path, 
but often leads to m ore sequential behaviour (i.e. low er core u tilisation, and m ore steps). A SA P 
has the effect o f  giving precedence to parallel arm s o f the data flow graph, but often leads to 
the critical path o f  the entire D FG  not lying on the critical path o f each step produced (i.e. 
additional idle tim e latency is incurred). F igures 4.25 and 4.26 com pares these tw o approaches 
using som e sim ple exam ples.
Clearly, the m obility-based list scheduling approach has its w eaknesses. This stem s from  the 
fact that the m obility sort o rder o f  the ready list places a fixed order o f  visitation to instructions 
belonging to different data paths, o r to different parallel arm s o f the sam e data path. The 
problem  lies when instructions from  m ore than one o f these are able to be inserted, w hich leads 
to contention. The order in w hich these arm s are added can lead to very d ifferent schedules, 
and in the w orst case (i.e. if  one o f  these lies on the critical path) the w rong choice can result 
in a schedule w ith a sub-optim al total critical path (i.e. the sum  o f the critical paths o f  each 
configuration in the sequence).
24the exam ples are NOT intended to produce the sam e result.
89
Scheduling
dp 1 dp 2
( p  ( p  ©  


















Total critical path: 16ns + 1x reconfiguration time


























Total critical path: 14ns + 1x reconfiguration time
Figure 4.25: Comparison of as soon as possible (ASAP) and as late as possible (ALAP) mobility 
based list scheduling techniques, with an arbitrary data flow graph. There is con­
tention for the add cell resource (there are only two instances of the cell in this exam­
ple, but three ADD operations in the data flow graph). Numbers indicate the order in 
which the operations were scheduled. In this example, the ALAP scheme prevails.
The m obility sort order enforces a fixed precedence, which can easily cause the w rong arm  to 
be visited (and scheduled) first. Furtherm ore, when com bined w ith other dependencies later 
in the data paths, such contention can lead to bubbles (poor cell utilisation) and therefore an 
excessive num ber o f configurations being generated. This is bad for perform ance (throughput), 
and increases the program  m em ory requirem ents.
90
Schedu ling












Total critical path: 12ns + 1x reconfiguration time






















Total critical path: 14ns + 1x reconfiguration time
Figure 4.26: Comparison of as soon as possible (ASAP) and as late as possible (ALAP) mobility 
based list scheduling techniques, with another arbitrary data flow graph. There is 
contention for the add cell resource (there are only two instances of the cell in this 
example, but three ADD operations in the data flow graph). Numbers indicate the 
order in which the operations were scheduled. In this example, the ASAP scheme 
prevails.
T he goal to m inim ising the total critical path and/or configuration context count, is to give 
precedence to the instructions that lie on the critical path in the original data flow graph. These 
are easy to identify. However, because they are by definition dependent on one another, even if 
they appear in the ready list before any instructions from  other paths, the instructions from  the 
other paths m ay get scheduled before those on the critical path, due to them  becom ing ready 
sooner. T herefore, there is no obvious way to directly  encode a fixed v isitation order to augm ent 
the m obility sort o rder in the ready list.
91
Scheduling
4.9.2 Contribution: Tree Follower
The work presented in this section provides a solution to the problem  described at the end 
o f section 4 .9. 1 , o f how to describe a sort o rder for the ready list such that the scheduling of 
instructions not on the critical path do not block instructions on the critical path. In general, no 
such sort order can be described. The proposed solution is to layer an algorithm  on top, to alter 
the ready list according to the current situation.
Figure 4.27: Flow chart of the proposed tree follower scheduling algorithm. Note that scheduling 
is done in reverse order.
The aim o f this new layer is to give precedence to the instructions that lie on the current data 
path (or arm  o f a data path), only sw itching to another data path under one o f the follow ing 
conditions:
•  once all the instructions on the current data path have been scheduled.
•  if no m ore can be scheduled due to resource starvation (or other constraints).
•  if scheduling the current instruction would cause an im balance com pared to other out­
standing edges in the ready list.
This reduces the chance o f  blocking, and should reduce the num ber o f connections split over 
step boundaries for each data path; thus reducing the register requirem ent. The algorithm  is 
shown in figure 4.27. N ote that no check is perform ed to check for register starvation when 
scheduling each edge, as it is likely that starvation m ay be subsequently avoided by scheduling 
other edges further up the sam e data path, depending on the shape o f that data path. Instead, a 




T he concept o f ‘balance’ is based on relative position (output delay) in the entire data flow 
graph. Scheduling is postponed for instructions on the current data path if the d ifference in 
output delay betw een the current instruction and any o f the ou tstanding ones in the ready list 
exceeds a pre-determ ined threshold. This avoids needlessly extending the total critical path o f 
all configuration contexts added together.
Essentially, the data paths are added as entries in a second ready list, w hich is sorted in order 
o f  critical path. The data path with the longest critical path is tried first. W henever a new 
configuration context (step) is created, the search begins again w ith the longest rem aining data 
path. W ithin a data path, w henever there is a choice betw een arm s to descend dow n, alw ays the 
one w ith the longest critical path is considered first25.
(a)
(b)
d p  1 
©
d p  2
1° C - T T ^ )
|
Tree Follower 
Scheduling • o i
/ ' add ' ' \  / ' a d d ' ' \ © => 0 8©  i
©  © / ' a DD'N 2©  i
© © 0  i
D F G
Critical path: 10ns
S te p  1
Critical path: 4ns
Total critical path: 11 ns + ‘
d p  1 d p  2
|
© 9 ( e3 ) ' ©  1
©  © Tree Follower Scheduling 10( e1 J 8 f  e4 JSr
/ ' ad d 'N  / ' a d d N
S y © => 0 O 7©  i









S te p  1
Critical path: 4ns
ADD 'v .  f ADD





S t e p  2
Critical path: 7ns 
Total critical path: 11ns + 1x reconfiguration time
Figure 4.28: The tree follower scheduling algorithm applied to the same examples as before: (a) 
figure 4.25, and (b) figure 4.26; where only 2 instances of the add resource are avail­
able. The resulting schedule is more efficient in both cases.
It should be noted that in o rder to m aintain correct program  behaviour, the branch/jum p instruc­
tion (JUMP) m ust be p laced in the last step o f  a basic block. S ince the scheduling algorithm  
is essentially  open-ended ,26 it is difficult to ensure this via constraints. O ne solution w ould be 
to delay the insertion o f  the JUMP instruction until an otherw ise com plete schedule has been 
created. However, this w ould lead to a large distance (num ber o f  step boundaries) betw een the
23appears earlier in the lower-level ready list.
26it keeps creating new steps until there are no instructions left.
93
Scheduling
JUMP instruction and its predecessors, w hich consum es registers. An alternative is to sim ply 
schedule in reverse, so that the first step created is the last step to be executed in the program  
sequence. Instructions that have no successors (such as the JUMP instruction) will therefore be 
scheduled first, thus placing it in the first step generated. Furtherm ore, the norm al operation o f 
the algorithm  w ould ensure that its predecessors be placed nearby, thus reducing— and in m ost 
cases elim inating— the need for registers to store their value across step boundaries.
The output o f the scheduling algorithm  is a data m odel describing the steps that w ere generated. 
Each step lists the DFG edges (see section 4.5) that w ere scheduled in that step, along w ith a list 
o f DFG edges w hose value m ust be brought into that step from  the next step in sequence .27 A  
tem porary register is assigned to  each of these edges brought in from  the next step. F igure 4.29 
shows an exam ple o f this output data model.
Step LI: 
/i
Temporary registers bringing values into this 
None
step (from previous step):
Edges in this step:




Temporary registers bringing values into this step 
r6(el), r8(e3), r2(e6), r3(e8), r4(el0), r5(el3)
(from previous step):





Temporary registers bringing values into this 
r6(el), r8(e3), r5(el2), r4(el4)
step (from previous step):
Edges in this step: 
el5, el6, el7, el8
}
Figure 4.29: The step data model produced by the scheduling algorithm for the example in fig­
ure 4.22 (on page 85), with an arbitrary assignment of temporary registers to the 
DFG edges being bridged across each step boundary.
W hen im plem enting the algorithm , it is possible to optim ise the search order by im m ediately 
visiting the predecessors o f a freshly scheduled instruction, in descending order of start delay.
The effect o f this algorithm  is essentially to place the longest data path first (partitioned into 
steps), then pack the next longest data path around it, and so on. This leads to a very tight 
packing, and thus a high core utilisation, high throughput, and reduced program  m em ory re­
quirem ent.
However, despite m inim ising the num ber o f connections split per data path, visiting each data 
path often leads to m ultiple data paths having connections split across step boundaries. Overall, 
this leads to a higher chance o f  register starvation. Section 4.10 discusses ways to resolve this.
:7i.e. since the scheduling is perform ed in reverse, results are seen to propagate from  later steps to earlier steps.
94
Schedu ling
4.10 Register Starvation Avoidance
A s d iscussed in section 4.5, the internal representation  o f  the data flow graph (D FG ) o f  a  basic 
b lock exposes the parallelism  inherent in the data paths described by the instructions. The task 
o f  parallelisation (section 4.8) uses this inform ation to determ ine an optim al packing o f  these 
independent data paths so that as m any as possib le run concurrently. If the com plexity  exceeds 
the resources available in the target architecture, then a partial serialisation is chosen using 
the algorithm  described in section 4.9. T he scheduler assigns a register to store the value o f 
each broken data path across each step boundary (referred to as tem porary registers). If  there 
are insufficient registers available for this purpose, then scheduling fails. This section looks at 
techniques for avoiding failure.
T he particular serialisation chosen by the com piler (expressed in the assem bly) ensures that 
the num ber o f  data paths that are broken at any m om ent in ‘tim e’ (from  one instruction to the 
next)— each requiring a register to store their value— never exceeds the num ber o f registers 
in the target architecture. If it is unable to ensure this, the com piler uses the system  data 
m em ory28 to  store the excess broken data paths at each m om ent in tim e. How ever, by doing 
so, the resulting m em ory access operations require that certain  operations are executed in a 
particu lar o rder (i.e. rem ain serialised), and step boundaries are needed betw een each state 
change29. This significantly reduces the extent to w hich parallelisation is possible, and thus the 
th roughput is dram atically  affected20.
T herefore, a trade-off has to be m ade to ensure that the com piler has enough registers available 
to avoid using the stack, and yet the scheduler m ust have enough registers set aside (as scratch) 
for use as tem porary registers. N ote that the live register identification algorithm  described in 
section 4.7 goes a long way to circum venting this trade-off, by m aking available any unused 
registers that w ere originally  under the control o f the com piler. N evertheless, if  the basic block 
is particularly  com plex— i.e. requiring many tim es as m any com putation resources as there are 
in the core— there m ay sim ply not be enough registers available to allow  the optim um  partial se­
rialisation to be used. A w ork-around is needed to allow an alternative partial serialisation to be 
chosen— one that requires few er registers. This process is called register starvation avoidance.
F our m ethods o f  register starvation avoidance w ere devised. In descending o rder o f  desirability, 
these are as follows:
R ew ind: This essentially  pulls out fragm ents o f split data paths from  the current step, and 
forces them  to be p laced in later steps; possibly at the sacrifice o f  increasing the overall 
critical path length, if  this data path is part o f the critical path. Section 4.10.1.
Shuffle: D iscards the previous scheduling result and tries to pack the data paths in a random  
order, rather than in descending order o f  critical path length. This is repeated  a few  tim es, 
if  necessary. Section 4.10.2.
28 m ore specifically, the stack.
Mto allow  the in term ediate results to pass into and out o f  the external memory, 
’"particularly  when taking into account the m em ory bandw idth.
95
Scheduling
Basic block splitting: Splits the basic block in half (or close to half, if  ’w ires’ becom e split), 
and schedules each fragm ent separately. This im proves the chance o f achieving a valid 
schedule, as it reduces the num ber o f independent data paths available for scheduling at 
once. Section 4.10.3.
Serialisation: This is similar' to basic block splitting, but even m ore pervasive. U ses the fact 
that the serialisation chosen by the com piler is itse lf a valid (but extrem ely inefficient) 
schedule. W orks through the instructions in the order that they appear in  the assem bly, 
using the registers nam ed in the instructions. Som e o f these registers becom e wires, 
if  the instruction can be packed into the sam e step as the instruction that preceded it. 
Section 4.10.4.
W henever register starvation is encountered, the scheduler tries these in the order listed above. 
The first strategies are less likely to com prom ise the perform ance of the resulting code, but are 
also less likely to be able to resolve the conflict. The need for flat scheduling has yet to be 
encountered. As a result, flat scheduling has not been im plem ented. The sections that follow  
describe these techniques in m ore detail.
4.10.1 Rewind
If  register starvation occurs in the current step during scheduling, this approach goes back to 
the last valid state (without starvation), and creates a step boundary there. This usually has the 
effect of pulling the last few (incom plete) independent data paths from the current step, and 
placing them  into a new step. D ata paths incom pletely scheduled in a step require tem porary 
registers to bridge the step boundary. Therefore, by reducing the num ber o f split data paths, 
this has the effect o f decreasing the num ber o f tem porary registers needed. However, this is 
potentially at the sacrifice o f increasing the overall critical path length, if  the data path(s) in 
question are part o f the critical path.
The exam ple in figure 4.30(a) shows the data flow graph o f a basic block containing a single data 
path. The exam ple assum es that the core has 4 cells (add cells) that support the ADD operation. 
As a result, the entire data path cannot be m apped to a single step. It also assum es that there 
are no scratch registers available to store tem poraries, so ju st the input registers (5 in total) are 
available. The input register r 4 is live on exit. The tree walking scheduling algorithm  (described 
in section 4.9.2) would begin by scheduling e l9 ,  since it has the longest com binatorial start 
delay, and then would descent down the branch containing e8, scheduling into the current step 
all the DFG edges in that branch, except those representing input registers, i.e. edges e l ,  e3, 
e4, e5, e6 and e8 w ould be scheduled.
Rewind points are saved each tim e there is at most five broken edges that w ould need to be 
stored. For instance, a rew ind point would be saved after having scheduled edges e l9 ,  e8, e6, 
e5, e4, e3 and e l ,  since only edges e2, e7, e9, and e l8  (4 in total) would need to be stored. It 
would then continue to the next predecessor o f e l9 ,  i.e. e l8 ,  which again would be scheduled 
in the current step— but no rewind point would be saved, since e2, e7, e9, e l6  and e l7  (6 in 









Figure 4.30: (a) Example basic block DFG consisting of one data path. The core has insufficient 
ADD cells to schedule the whole block into a single step, (b) Last rewind point (before 
scheduling el6), no register starvation, (c) The resulting schedule if e l6  was scheduled 
in the current step, requiring six temporary registers, leading to starvation (missing 
register shown in red), (d) The resulting schedule after rewind— i.e. with e!6  delayed 
until the next step, leading to a valid schedule.
97
Scheduling
Scheduling e l 6 in the current step exhausts the available a d d  cells, m aking it im possible for 
e l l  or e l5  to be scheduled in the current step. The scheduling algorithm  would continue to 
schedule e l3  and e l2 ,  resulting in register starvation, as shown in figure 4.30(c). If the rewind 
register starvation avoidance schem e were then invoked, the situation shown in figure 4.30(b) 
would be restored, and a new step boundary created. Scheduling would then continue in the 
next step, and a valid schedule would be achieved, as shown in figure 4.30(d).
This approach has the lim itation that although it m ay be possible to m ake the current step valid 
(i.e. free from  starvation), there is no account taken o f how this will affect subsequent steps. 
This is the normal m echanism  o f failure for this approach— rew inding allows a data path to be 
partially contained in the current step, but the registers incurred to store the broken edges often 
lead to starvation in later steps; w hereas if the data path had been com pletely postponed until a 
later step, starvation would have been avoided.
4.10.2 Shuffle
This is based on a random  re-ordering o f the root nodes. The previous schedule is discarded, 
and scheduling starts over from  the beginning again. This tim e the independent data paths are 
considered in a random  order, instead o f in descending order o f critical path length. This has the 
effect o f increasing the chance o f sm aller data paths being scheduled first. Sm aller data paths 
are less likely to require splitting, and thus are likely to consum e few er registers. At the sam e 
tim e, this has the effect o f causing few er o f the longer data paths from  appearing in parallel 
(i.e. serialises them ). This potentially increases the total critical path length o f  the resulting 
schedule (all steps), which com prom ises throughput. Rewind is always used in addition to this 
technique, to prevent too m any o f the sm aller data paths from being packed into earlier steps.
F igure 4.31(a) shows an exam ple basic block consisting o f three independent data paths. The 
core is lim ited to one mul cell (MUL instruction), one add cell (ADD instruction), and two 
const cells (CONST instruction). There are no scratch registers available for storing additional 
tem poraries. As a  result, only the input registers ( r 2, r3 , r4 , and r 6— a total o f  four) are 
available for storing tem poraries. The norm al root node order would cause edges from  data 
path 2 to be visited first (since the critical path is longest), then data path 3, then data path 
1. So, scheduling would begin w ith e l2  (which has the longest start delay), then proceeds up 
that data path to schedule e l l ,  elO, and e9. It can ’t schedule e 8, since the mul resource has 
already been used. This exhausts data path 2 for this step, so m oves onto e l9  (in data path 
3), w hich is o f the next longest critical path. e l9 ,  e l 8, and e l7  can all be scheduled, but e l 6 
can ’t due to starvation o f the add resource. e3 (o f data path 1) is visited next, but cannot be 
scheduled, for the sam e reason. No m ore edges can be scheduled, so a new step is begun. There 
are 4 tem poraries needing stored (e3, e 8, elO, and e l 6), which is within the availability. The 
situation is shown in figure 4.31(b).
Scheduling o f the next step begins by going back to data path 2, so schedules e 8, e 6, e7, and e4. 
This exhausts data path 2, so data path 3 is visited. e l 6 and e l5  are scheduled, but e l3  can’t due 
to starvation o f the c o n s t  resource. Data path 1 is then visited, but e3 can ’t be scheduled due 
to starvation o f the add resource. No more edges can be scheduled in the current step, so a step 
boundary is added. Doing so here requires tem porary registers to store e3, e5, elO, e l3 , and e l4  









Figure 4.31: Example basic block DFG demonstrating the need for shuffling of the data path 
search order as a means to avoid register starvation, (a) The data paths and DFG 
edges, (b) The first step created using natural (depth first) search order (yet to be 
scheduled edges are shown with a dotted outline). Register starvation has not yet oc­
curred. (c) The second step created, leading to register starvation (missing register 
shown in red), with no valid rewind point, (d) A valid schedule resulting from visiting 
the data paths in a shuffled order.
there is no rew ind point to go back to, since too many tem poraries w ere needed follow ing the 
insertion  o f  each edge in this second step. i.e. rew inding w ould com pletely  unw ind that step, 
leading to exactly the sam e choices to be m ade on the next attem pt.
This failure can be avoided by discarding that schedule, and beginning again w ith a shuffled 
root edge list. This has the effect o f visiting the data paths in a different order. The result is 
show n in figure 4.31(d). H ere, data path 3 is visited first, causing it to be scheduled com pletely 
in the first step. D ata path 2 is then visited next, w hich can be scheduled up to e l l ,  w ith  e9 
having to be in the next step due to to starvation o f  the a d d  resource. S im ilarly, e3 (o f data path 
1) cannot be scheduled. A  step boundary is created, requiring 4 edges to be stored in tem porary
99
Scheduling
registers, w hich is w ithin the availability. Scheduling of the next step begins w ith e9 o f  data 
path 2, since data path 3 has been exhausted. The rem ainder o f data path 2 can be scheduled 
w ithout problem . e 3 o f data path 1 cannot be scheduled, due to starvation o f the a d d  resource, 
so another step boundary is created. This requires 4 tem poraries, which again is w ithin the 
availability. The rem aining edges ( e l  and e3) can be scheduled, and the input registers placed.
4.10.3 Basic Block Splitting
In this approach, the schedule and data flow graph are discarded, and the basic block is m odified 
at the instruction level to split it into two sm aller pieces. The purpose o f this is to  reduce the 
num ber o f independent data paths that need to be scheduled together. It also leads to large 
data paths being broken into sm aller pieces. A split point is chosen near to the m iddle o f  the 
basic block31. The split point may have to be adjusted in order to avoid breaking any w ires32, 




Figure 4.32: Example of basic block splitting, (a) Original basic block assembly, (b) Original 
data flow graph, (c) Assembly after splitting into two blocks, (d) Resulting data flow 
graphs of the two new basic blocks, (e) Assembly after another split, (f) Resulting 
data flow graphs of the three new basic blocks.
i.e. where roughly the sam e num ber o f  instructions appear in each new fragment,
connections that require both end points to be in the sam e step.
100
Schedu ling
A sim ple exam ple is show n in figure 4.32. T he basic b locks created  by the split are then treated 
separately, as if  they had appeared in the original program . This process o f  splitting can happen 
any num ber o f  tim es, m aking the b locks increasingly  sm aller, to the lim it o f  approaching sim ilar 
behaviour to the serialisation  approach, described in section 4.10.4.
4.10.4 Serialisation
This approach involves creating a new  schedule by w orking through the instructions in the o rder 
that they appear in the assem bly, using the registers nam ed in the instructions. If  the instruction 
can be packed into the sam e step as the instruction that p receded it, then the register nam ed for 
that connection  can becom e a w ire. This approach m akes direct use o f  the fact that the com piler 
has already com e up w ith a valid assignm ent o f  ‘tem porary’ registers fo r the w hole basic block, 
using the pool o f  registers under its control. A  step boundary  is p laced under the follow ing 
conditions:
•  T here are insufficient resources to place the curren t instruction into the current step.
•  T he constrain ts require a step boundary at this point.
•  T he register that is to store the result o f the current instruction  is already going to be 
w ritten to by an instruction in the current step.
T he last is an in teresting  point: after scheduling m ore instructions, som e o f  these will becom e 
w ires again, in w hich case this instruction doesn ’t conflict33, but this is not know n until after 
scheduling more instructions.




The final stage o f creating an abstract netlist for the target architecture is to convert the internal 
data model into a form  that is com patible with the netlist syntax. The netlist syntax describes 
the configuration contexts (steps) in term s o f active instruction cells, their configuration, and 
their connectivity. The internal data model chosen for the tasks o f parallélisation (section 4.8) 
and pipelining (chapter 5) operate on data flow graph (DFG) edges (section 4.5). These edges 
are with reference to the data flow graph o f the original basic block. The scheduling algorithm  
will have packed these DFG edges into one or m ore steps. So, the data flow graph o f these 
individual steps m ust then be reconstructed, and the connectivity extracted. This section shows 
how this can be done.
A DFG edge can either represent reading from  a register, reading from  the output o f  an in ­
struction cell, o r reading an im m ediate value that should be stored in a register. Furtherm ore, 
the data model for the parallélisation phase describes registers that have been assigned to store 
tem porary values across step boundaries, and pipelining describes registers that have been as­
signed as pipeline stage registers. A ll o f these m ust be taken into account when reconstructing 
the individual step data flow graphs.
F igure 4.33 continues the parallélisation exam ple from  figure 4.22. F igure 4.33(a) shows the 
interna] representation obtained during parallélisation. This corresponds to figure 4.22(c), after 
tem porary registers have been assigned on each step boundary. N ote that since the schedule in 
constructed in reverse, the tem porary registers are seen to be bringing values o f stored edges 
into the step. Tem porary registers w ere assigned first from  the group o f active registers in 
the basic block (r3 , r4 , r5 , r 6, r 8), and then from  a pool o f scratch registers (rlO , r l l ,  r l 2 ,  
r l3 ) .  F igure 4.33(b) shows the individual step data flow graphs that should be derived from  the 
internal representation, and figure 4.33(c) shows the sam e inform ation expressed in the RICA 
netlist syntax. Tem porary register assignm ent has m inim ised the num ber o f register-to-register 
copies needed, so some registers storing tem porary values across the step boundaries do not 
get w ritten to, since they continue to store the sam e value over each boundary, w here possible 
(which is in all cases in this exam ple).
The process can also choose to re-allocate certain cell instances— i.e. choose to swap certain 
instances o f the sam e type— w here the order matters. The only situation encountered thus far 





bringing edges into step:
Edges in this step:
Step: L1_step2
Temporary registers
bringing edges into step: r3(e10), r10(e13)
Edges in this step: e9, e11, e12, e14
N/A
e1, e2, e3, e4, e5, e6, e7, e8, e10, e13
r6(e1), r4(e3), r8(e6), r5(e8),
Step: L1_step3
Temporary registers 
bringing edges into step:
Edges in this step:





{ reg[6]; // el(in), el(temp-out reg[4); II e3(in), e3(temp-out rmem(0J { in_addr = reg[6].out; rmem[l] { in_addr = reg[6].out; rmem(2] ( in_addr = reg(6].out; rmem(3) { inaddr = reg[6].out; iul(0] { ini = reg(4].out; in2
con£='RMEM_READ_SI; > conf='RHEM_READ_SX; > conf='RMEM_READ_SI; > conf='RHEM_READ_SI; }= rmem(0).out; conf='MUjmul[1] { ini = reg[4].out;; in2 = rmeml1J.out; con f='MUL_MUL; ]) II e8add[0]{ ini = reg[3].out;; in2 = mul(0].out; COnf='ADD_ADD; > 1/ e6reg[8] { in = add[0).out; ) II e6(temp-out)reg[5J { in = mul[1J.out; } II e8(temp-out)reg[3] { in = rmem[2).out;; } // e5(in), elO(temp-out)reg[10]{ in = rmem[3).out;; ) // el3(temp-out)
istepf %Ll_step2Ji
reg[3); // elO(temp-in)reg[4); II e3(temp-in). e3(temp-out)reg[6); II el(temp-in), el(temp-out)reg[10 ); II el3(temp-in)add[0) { ini = reg(8).out;; in2 *= reg( 5 ) .out; conf»'ADD ADD; } II e9mul(0 ){ ini = reg(4).out;; in2 = reg(3J.out; conf='MUL_MUL; } II ellmul[1] { ini “ reg[4].out;; in2 = reg(10].out; conf='MUL_HUL; ]> // el4add[1J{ ini = add[0].out;; in2 c mul(0).out; conf='ADD_ADD; } II el2reg[8] { in = add[1J.out; } // e6(temp-in). el2(temp-out)reg(S] ( in = mul[1).out; } II e8(temp-in). el 4(temp-out)
}step[%Ll_step3)
{ reg(4); // e3(temp-add[0) { ini » reg[8) add(1J { ini = reg(6] comp[0] { jump[0] { reg(3) {reg[5) { reg(6) { ; reg[8] { :
add(1).out; in; addr = f?Ll; cond ■ add[0J.out; } /,add[OJ.out; ) /,add[1].out; } /,reg[4).out; > /,
e3(out); in2 = reg(5j.out; conf='ADD_ADD; }.; in2 = 4; conf='ADD_ADD; } i 2 = 20; conf='COMP_LT; }comp[0].out; conf='JUMP_IF_NEZ; el5(out)el4(temp-in), el5(out) el(temp-in), el6(out) el2(temp-in), e3(out)
// el6 
II el7 > II el8
Step: L1_step3
(c) (b)
Figure 4.33: Example from figure 4.22 (on page 85) converted to steps, (a) the internal data model 
from parallélisation, (b) the individual step data flow graphs, (c) the netlist generated.
103
Scheduling
4.11.1 Background: RMEM Cascading
Each RM EM  (read data m em ory) cell (cell nam e rmem, instruction RMEM) sam ples its address 
input a program m able num ber o f clock cycles after the beginning o f the step, to allow  the 
address value to settle. The m em ory fetch then begins, returning the value at the output o f  the 
cell, w ithin the sam e step. D ue to the presence o f arbitration logic, conflicting m em ory accesses 
im pose an additional (dynam ic) delay, as the contending accesses are queued (serialised). The 
step duration counter is frozen until the queues are empty, to allow the life tim e o f the step to 
be extended so that all outputs settle before the next step begins.34
D ependencies betw een different m em ory reads w ithin the sam e step are expressed via the ce ll’s 
configuration, and this requires special treatm ent. This section describes the considerations 
involved.
rmem[0] rmem[1] step time
I------------ 1 I------------ 1 I-------------
o ; o ; 1 f  ° ; 1 i 1 1| o i o i
T start delay 
cascade
I l  I I
X o : 0 i 1
r
° ; 1 ; 1 | j o ; o ; 1 ; 1 ; 1
r start delay 
cascade
(a) ( b )
Figure 4.34: Step data flow graphs showing memory read operation cascading, (a) example of one 
data path containing two independent memory read operations, (b) example of one 
data path containing a chain of two dependent memory read operations. The section 
of the configuration word corresponding to the memory access cells and RRC step 
time, is shown for each, with the cascade bit highlighted.
Figure 4.34(a) shows an exam ple step data flow graph, involving two m em ory accesses. W hen 
these access non-conflicting addresses at run-tim e, then they can happen in parallel. This is 
shown in figure 4.35(a), where the m emory fetches (shown by the black arrows) overlap in 
time. If there is a conflict (i.e. they are both accessing the same m em ory bank), then the 
arbitration logic serialises them , as shown in figure 4.35(b).







Figure 4.35: Timing diagram (horizontal axis representing time) for the step DFG shown in fig­
ure 4.34. RRC clock cycles are shown by dotted lines. Boxes show cell combinatorial 
delay, grey boxes show interconnect delay, gaps represent idle time. Events are shown 
in thick dotted lines, (a) Timing for figure 4.34(a) when no memory access conflict ex­
ists, (b) timing for figure 4.34(a) when a memory access conflict exists, resulting in 
additional dynamic delay (due to being queued), (c) Timing for figure 4.34(b) (no 
conflict possible).
B efore pipelining was m ade possib le by the w ork outlined in chapter 5, dependent m em ory 
operations w ithin a single-step kernel w ere only possib le via the addition o f  hardw are support. 
This is referred  to  as cascading. This allows m ultiple instances o f the rm em  cell to be involved 
in the sam e data path, and be dependent upon each other35, by ensuring that the address read 
delay o f  the cascaded cell is extended by any dynam ic delay introduced by queuing during the 
preceding m em ory read operation.
F igure 4.34(b) illustrates cascading. C ascading is described in the configuration w ord by a 
single bit for each rm em  cell instance (the cascade bit— show n in grey in the figure), w hich 
indicates w hether or not the cell is cascaded to the cell w ith the preceding instance index, 
rm em  [ 0 ] has no cascade bit. A  start delay o f  0 m eans one clock cycle from  the reference 
point, w hich is either the beginning o f  the step (if not cascaded to the preceding cell), o r when 
the preceding instance returned its value (if cascaded). The step tim e field does not include 
m em ory latency, as the counter is paused w hilst the fetch is happening. This can be seen in 
figure 4.34, as the bit patterns differ only by the cascade bit, yet the actual tim ing o f  the events
35i.e. w here the read address o f  one is affected by the value returned by another.
105
Scheduling
is different in the two cases presented: i.e. the total fetch tim e and execution tim e is longer 
in (b), but the com binatorial critical paths o f  both are similar. Also, both cases share the sam e 
start tim e pattern for rm em  [ 1 ] ,  despite the fact that in (b) the fetch will begin much later (due 
to cascading).
N ote that w ith the introduction o f support for internally pipelined cells (section 5.7), data m em ­
ory can be accessed m uch m ore efficiently by pipelining the m em ory read operations such 
that the address is given in  one step/iteration, and the result is obtained at the beginning o f a 
step/iteration som e num ber o f  iterations later. This allows the latency o f the m em ory access to 
be hidden by pipelining the step around the internal pipeline geom etry o f the m em ory access 
operation, thus preventing the iteration interval from  being lim ited by the m em ory access la­
tency. This can be seen in figure 4.36, w hich shows the tim ing diagram s corresponding to  those 
in figure 4.35, but w ith internally pipelined cells. The total num ber o f cycles per iteration is 
less in all cases. N ote that further im provem ents in throughput can be achieved by pipelining 
the step around the internal pipeline geom etry o f the cells, elim inating the m em ory idle time.
sample address result ready
_̂___________ memory lalenby J
step begin step boundary step boundary step end
(b)
memory latency
1—I ,rnorT,ii 11 i
resulljready 
step boundary
Figure 4.36: Timing diagram (horizontal axis representing time) for the step DFG shown in fig­
ure 4.34, with internally pipelined memory access cells, (a) Timing for figure 4.34(a) 
when no memory access conflict exists, (b) timing for figure 4.34(a) when a memory 
access conflict exists, resulting in additional dynamic delay (due to being queued), (c) 
Timing for figure 4.34(b) (no conflict possible).
106
Scheduling
4.11.2 Contribution: RMEM Cascading Algorithm
A n algorithm  is needed to determ ine when R M EM  cascading is required , and in w hat o rder the 
rm em  cells m ust be allocated. This section proposes one such algorithm . In o rder fo r cascading 
o f  R M EM s to w ork, the RMEM operations in a dependency chain (all w ithin the sam e step) m ust 
be allocated to rm em  cells contiguously in asceding order o f  index, starting with the one at the 
beginning o f the chain.
T he approach is to analyse the data paths in each step that involve RMEM operations, by rem ov­
ing all o ther operations from  the data paths. These R M EM -only data paths can then be easily 
inspected  to determ ine w hich RMEM operations depend on w hich o thers w ithin the sam e data 
path. The order o f independent RMEM operations w ithin a data path is irrelevant, but dependent 
operations m ust be contiguous and in the correct order.
T he RMEM operations are first re-ordered so that each R M EM -only data path consists o f con­
tiguous rm em  cell instance indexes. This is done by visiting the R M EM -only data paths in turn, 
in arbitrary order. Then, w ithin each R M EM -only data path, the operations are re-ordered in 
ascending order o f  start delay. This m eans that there will be no R M EM  data paths interleaved, 
and all edges w ithin each R M EM  data path w ill be in  propagation  order. This satisfies the 
requirem ents stated above.
Figure 4.37: Analysis of RMEM operations, (a) Original step data paths, with RMEMs shown in 
bold, (b) Step data paths reduced to only the relationship between RMEM operations, 
(c) RMEM-only data paths with cell instances re-ordered for cascading.
Figure 4.37(a) gives an exam ple, show ing the data flow graph for a  step contain ing two data 
paths, each o f  w hich has RMEM operations. This is in the form  o f the data m odel follow ing 
parallelisation and resource allocation. As a result, the nodes represent physical cell instances. 
F igure 4.37(b) shows the sam e data paths after having rem oved all other operations, leaving 
only  the RM EM s. The relationship betw een the R M EM s (i.e. dependencies) can be determ ined
(a) ( c )
107
Scheduling
directly by looking at the im m ediate predecessors. However, at this point, the relationship is not 
yet com patible with cascading, since the data paths do not yet contain contiguously assigned 
rm em  cell instances. F igure 4.37(c) shows the sam e RM EM -only data paths w ith the cell 
instances re-ordered, which is now com patible w ith cascading.
It should be pointed out that the instruction cell resources of a given type are allocated in the 
sam e order as the D FG  edge nam es, and the edge nam es are ordered the sam e as the instructions 
in the assembly. The nature o f  the assem bly is such that it is im possible for an instruction to 
appear in the assem bly before another instruction that depends on it. This m eans that RM EM s 
will already be ordered according to w here they appear in the data path that they are part of. 
However, it is still possible for an independent RMEM instruction to have been inserted into the 
assem bly between two dependent RMEM instructions, which is why the re-ordering is required. 
The order shown in figure 4.37(b) is indicative o f this interleaving o f data paths in the assembly.
W ith the RMEM operations re-ordered (as shown in figure 4.37(c)), it is then possible to de­
term ine w hich cascade bits to enable. If  an operation is dependent on another in the sam e 
RM EM -only data path, then the cascade bit is enabled. In situations w ith m ore than two RMEM 
operations in the sam e data path, the cascade bit may also need to be set. Essentially, w here 
an RMEM operation is dependent on m ore than one other RMEM operation, despite those other 
RMEM operations being independent o f  one another, the cascade bit should be set on all but the 
one with the lowest instance index. So in the figure, rm em  [ 1 ] and rm em  [ 2 ] would both have 
the cascade bit set, despite rm em  [ 1 ] being independent o f rm em  [ 0 ] .
These are false positives, but are necessary since the dependent m em ory read m ust occur after 
the com pletion o f all the m em ory reads on which it depends (including any dynam ic delay). 
There is no way to directly describe this situation in the configuration. The algorithm  will have 
re-ordered these operations into a contiguous set o f cell instances, w ith the cascade bit being 
set on each o f  them. This serialises each o f the m em ory accesses (including the independent 
ones), which achieves the desired effect o f taking into account all o f the dynam ic delays, but 
also decreases the throughput, since it rem oves the parallelism . This is a hardw are lim itation 
that could be avoided at the expense o f additional configuration bits, but since this situation is 
rare, this was deem ed unnecessary.
108
Scheduling
4.12 Global Register Reallocation Information
T he scheduler has no concept o f  how the interconnect will be configured for each step. This 
is the task o f  the routing tool {mapper). Therefore, the choices as to w hich cell instance is as­
signed to  w hich task in each step, m ay not be the m ost efficient in term s o f  physically  m apping 
to the array— i.e. m ay have longer paths than necessary. Only the routing tool has the infor­
m ation needed to  optim ise the allocation o f  w hich active operation o f  each type in each step 
m aps to w hich instance o f  the physical cell o f  that type. T he routing tool therefore requires the 
freedom  to change this allocation. This section first proposes several m ethods for how  the m ap­
per tool could im prove routability by reallocating registers. Then an algorithm  is proposed for 
generating  the inform ation as to w hich registers are connected across step boundaries. Finally, 
a m ethod for how  the m apper w ould use this inform ation is proposed.
R eallocating  stateless instruction cells is no problem , and can be done arbitrarily  w ithout affect­
ing the behaviour o f  the program . However, cells that have state m ust be reallocated  globally, 
so that the in form ation stored there is consistent betw een uses. R egisters are the m ost com ­
mon exam ple o f  cells that m aintain state. Due to their prevalence— several in alm ost every step 
o f  the program — the allocation o f  registers has a significant effect on the perform ance o f  the 
program ,36 and on the routability .37
Figures 4 .38 and 4.39 dem onstrate this by exam ple. The w hite boxes represent the steps o f  the 
program , w ith the thick grey arrows show ing control flow betw een them. T he program  consists 
o f  tw o loops, one w ith a  single step (i.e. a kernel), and one w ith several steps, plus the usual 
program  entry and exit steps, w ith som e com putation being perform ed prior to entry to the first 
loop. This is best seen in figure 4.38(a). The m ajority o f  com putation is done in the two loops.
T here are 4 registers in this exam ple, w hich pass inform ation betw een  the steps. T he registers 
are show n in figure 4.38(b) on entry to and on exit from  each step, w ith dark vertical lines (and 
loops) show ing the lifetim e o f  the inform ation stored there. Each inform ation line has a  num ber 
w ritten beside it,38 w hich indicates the globally unique index o f that p iece o f  inform ation. W hen 
a line originates from  the output o f a step (the bottom  side o f  the box), that indicates that the 
inform ation is created inside that step. W hen a line ends at the input o f a step (the top side o f  the 
box), that indicates that the inform ation is last read from  in that step, and the register w ill either 
be dead on exit (no line on exit), or have a new  piece o f inform ation w ritten to it (another line 
begins on exit). W hen a line passes through a step, that indicates that the sam e inform ation is 
still stored in that register— i.e. it is not overw ritten in the step. A sm all coloured box is show n 
beside each step, indicating the am ount o f congestion present in that step. This congestion is 
assum ed to be due to the com bination o f  how busy a step is and excess lengths in the paths to 
and from  the registers.
36by their effect on the critical path.
37due to congestion resulting from  several long paths becom ing tangled up.
38at the m om ent when the inform ation is created.
3t)as will be the case in typical program s.
109
Scheduling










r2 r 5 r4
r3 r4
step2 □







Figure 4.38: Information flow diagram for a simple program, (a) Program steps and control flow.
(b) Information flow between registers, with original register assignment made by the
scheduler.
Figure 4.39(a) shows the program  with the original register allocation assigned by the sched­
uler40, and is identical to figure 4.38(b) but with some added highlights. The inform ation 
assigned to register r3  by the scheduler is shown in blue, and the inform ation assigned to reg-










Figure 4.39: Information flow diagram for the simple program in figure 4.38. (a) Original regis­
ter assignment made by the scheduler, (b) Register assignment following the global 
swapping of registers r.3 and r4. (c) Register assignment following a reassignment of 
the register associated to each piece of information originally mapped to r3 and ? 4. 
N.B. Each piece of information still has one particular register holding it throughout 
its entire life time, (d) Register assignment with the addition of register-to-register 
copies (limited to r3 and r I), allowing pieces of information to be moved to different 
registers.
ister r 4  by the scheduler is shown in purple, This inform ation is m oved around as a result o f 
reallocation. For clarity, figure 4.39 and associated com m entary will only consider the effects 
o f  reassignm ent betw een registers r3  and r4 . However, it should be noted that reassignm ent o f 
all four registers would be possible, and could  lead to significant im provem ents in routability.
Scheduling
4.12.0.1 G lobal R egister Swap
In the original register assignm ent made by the scheduler (figure 4.39(a)), we can see that the 
kernel LI is heavily congested. O ther steps in the program  are less congested, so one approach 
would be to globally swap one register for another such that the congestion is m inim ised in LI. 
To this end, figure 4.39(b) shows the result o f sw apping registers r3  and r4  globally, to reduce 
the congestion in LI. The inform ation originally assigned to r 3 is highlighted in blue, and the 
inform ation originally assigned to r 4  is highlighted in purple.
G lobally sw apping one register w ith another41 can im prove the situation for particular steps, 
but due to the substitution being global, the sam e allocation may be problem atic in other steps. 
In figure 4.39(b), the global reallocation has had a positive effect on the routability o f L l ,  and 
incidentally also on _main, but has had a negative effect on the routability o f  _main_step2, 
L2 and L2 step2. L2_step2 is a particularly busy step, so the overall effect o f the global 
reassignm ent in  this exam ple was to make the situation worse.
4.12.0.2 G lobal Inform ation Swap
Even if an algorithm  were devised to som ehow provide a com prom ise across all the steps in 
the program , the allocation resulting from  a global re-assignm ent could still be sub-optim al 
in m any steps— potentially leading to congestion or excessive critical path— and the problem  
becom es increasingly worse with program  size42.
A purely global reallocation is unnecessary. The role of registers is to pass inform ation betw een 
steps. So long as the sam e register is used on both sides o f the step boundary over w hich 
the inform ation is passed, the program  behaviour rem ains the same. Therefore, it should be 
possible to swap two registers in only the steps w here those registers represent the sam e piece 
o f inform ation. This information sw apping  approach is dem onstrated in figure 4.39(c). It can 
be visualised as follows: as introduced earlier, each piece o f inform ation is represented by a 
continuous vertical bar (with possible circular feedback) passing through the sam e register on 
each step that it passes in to and out of. Each o f these bars can be dragged horizontally from  
one register to another, w ithout affecting the program  behaviour.
In this exam ple, only the inform ation previously assigned to r3  (i6) and r4  (i5) in the m ost 
heavily congested step, Ll, have been swapped. This affects the steps w here this inform ation 
also exists, i.e. main. step2 (on exit only), jnain_step3. L2, or where this forces another 
piece o f inform ation into another register, i.e. L2_step2 (on entry only). The effect is an 
im provem ent in the routability o f Ll (the prim ary objective), and a slight w orsening in the 
routability o f L2 and L2_step2. In this exam ple, L2_step2 has not been affected so badly 
this tim e, due to the extra congestion im posed by a global swap being the result o f the data 
paths writing to r3  and r4  on exit from  L2_step2 (i.e. inform ation ¡8 and i9), which becom e 
longer if those registers are swapped. Similarly, jmain_step2 has not been affected so badly 
this time, due to the extra congestion imposed by a global swap mostly being the result o f the 
data path reading from r'S (i.e. inform ation i3), w hich becom es longer if  r3  and r 4  are swapped. 
The net effect is an overall im provem ent in routability; w ith no heavily congested steps.
41 as is done for other cells that m aintain state.
“in term s of num ber o f  steps.
112
Scheduling
4.12.0 .3  In form ation  Cross-over
A further im provem ent is possible: the inform ation could be m oved from  one reg ister to another 
during its lifetim e, allow ing the inform ation to be transferred  to a register that is in a m ore 
optim al location in a busy step, w ith the m ove happening in a less busy step43.
F igure 4.39(d) shows the effect o f  this final optim isation. H ere the blue and purple inform ation 
bars can be seen to  have been bent in certain steps, w here the in form ation has been transferred  
to a different register inside the step. This occurs in _main^step3 and L2. T he transfer 
o f  i5 and i6 in _main_step3 allows jnain_step2 to be restored to the original register 
assignm ent, w hich results in  less congestion in that step— but at the expense o f a small reduction 
in routability  o f _main_step3. Sim ilarly, the transfer o f  ilO  in L2 reduces the congestion 
in L2_step2, at negligible cost in L2. T he net effect is a further im provem ent in overall 
routability.
T he next sections describe a m ethod to obtain the inform ation needed to perform  both o f  these 
partial register reallocation schem es: global inform ation swap or inform ation cross-over. This 
is referred to  as G lobal R egister R eallocation Inform ation.
4.12.1 Contribution: Obtaining The Global Register Reallocation Information
Prerequisites: Live register inform ation at the basic block level, and tem porary register as­
signm ent inform ation for each step that was constructed  from  the basic blocks.
R esults: The list o f  individual pieces o f inform ation entering and exiting each step in the pro­
gram  via each register.
The control flow graph and live register inform ation can be used to determ ine w hich input and 
live output registers represent the sam e piece o f  inform ation. This inform ation can then be used 
by the routing tool to allow  registers to be reallocated so that they appear closer on the array to 
the cells to w hich they are connected. This significantly reduces pressure on the in terconnect,44 
avoiding congestion, and decreasing the propagation delays.
T he inform ation obtained during scheduling/parallelisation (section 4.8) defines the lifetim es 
o f  each piece o f inform ation (m ore specifically, each D FG  edge— see section 4.5) betw een the 
steps derived from  each basic block. This then augm ents the live register inform ation obtained 
using the live register identification algorithm  (presented in section 4.7), w hich defines w hich 
registers actively store inform ation betw een basic blocks. By com bining the two, it is possib le 
to determ ine the lifetim e o f each piece o f inform ation across all possib le execution paths in the 
entire program .
In o ther w ords, after scheduling, it is possib le to equate each active register on input and output 
o f each step, to a globally  unique piece o f  inform ation. Each piece o f  inform ation is assigned a 
globally  unique ID.
43w here the overhead o f  the connection betw een the two registers will not cause problem s.
44since few er interconnect edges are needed.
113
Scheduling
For each junction  betw een basic blocks in the program  control flow graph (CFG), the live regis­
ter inform ation is used to determ ine w hich output registers and which input registers represent 
the sam e piece of inform ation being passed between those basic blocks, as shown in figure 4.40. 
A sim ilar thing is done for the tem porary registers betw een each step in  a basic block. T he latter 
is much easier, since the steps in a basic block are always executed in sequence (i.e. are always 
an exam ple o f  figure 4.40(a)).
(c) (d)
Figure 4.40: How control flow defines on which step boundaries a given register represent the 
same piece of information, (a) Linear control flow (e.g. between steps derived from 
the same basic block), (b) Control flow branch (e.g. conditional jump), (c) Control 
flow join (e.g. return from function), (d) Loop.
It should be noted that it is safe to underestim ate the num ber o f unique pieces o f inform ation 
present in the program , but it is dangerous to overestim ate it. U nderestim ating results in regis­
ters being bound together in m ore steps than is really needed, whereas overestim ating results in 
values being corrupted45.
For this reason, the presence o f  loops has to be carefully taken into account. In purely sequential 
control flow, when a given register is overwritten, this results in it storing a different piece o f 
inform ation on entry and exit. However, if  that sam e step sequence appears w ithin a loop, then 
the control flow path back to the beginning can mean that these are in fact the sam e piece of 
inform ation. This can be seen in figure 4.40(d), w here inform ation on entry to L2 and on exit 
from L2_step2 have to be the sam e piece o f inform ation. However, that inform ation is being 
updated within the loop. A com m on exam ple o f this is the loop counter. These are defined to 
be the same piece o f  inform ation, since the sam e register must be used at the beginning and end 
o f the loop in order for the new value to be seen correctly in the next iteration.
4:,since the register could be changed across a boundary where the sam e register should really be used.
I 14
Schedu ling
4.12.2 Contribution: Using The Global Register Reallocation Information
T he routing  tool can use the global register reallocation inform ation to re-assign reg ister num ­
bers, in  o rder to m inim ise interconnect path lengths. R eallocation  consists o f looking at each 
boundary betw een steps w here one step passes control to another. A t that boundary, a single 
register m ust be assigned to each piece o f  inform ation that needs to pass across that boundary. 
T he register m ay be chosen from  the set o f  registers available across that boundary. S ince each 
step m ay pass control to m ore than one o ther step,46 the assignm ent m ust be consisten t for all 
boundaries involving the sam e step.
4.12.2.1 M ethod 1: G lobal reassignm ent
T he o rder in w hich the boundaries are exam ined is significant. T he search space is potentially  
quite large, and the process o f  calculating the cost is expensive47. Therefore, a sensible heuristic 
is needed.
The proposed heuristic approach consists o f  first applying the global inform ation sw ap tech­
nique, follow ed by inform ation cross-over. The steps are visited in descending order o f com ­
plexity, such that the steps that are likely to  be the m ost congested  are dealt w ith first, thus 
having the m ost freedom  to reallocate. R egister reallocation  is perform ed on the boundaries in ­
volving each step, in this order. W hen a piece o f  inform ation has been reallocated to a different 
register, that allocation (inform ation to  register) is frozen in  all steps in w hich that inform ation 
exists. Each step is only allow ed to reallocate registers fo r pieces o f  inform ation that haven’t 
yet been frozen. This corresponds to the global inform ation swap technique.
If  a p iece o f  inform ation is frozen, but reallocating it to a d ifferent register in the curren t step 
would be advantageous, it m ay be possib le to m ove it to a different register, if  the control flow 
allows. This corresponds to the inform ation cross-over technique. Inform ation cross-over is 
norm ally  applied  only on step boundaries w ithin the sam e basic block, as these are usually the 
on ly  boundaries that have linear control flow. O ther form s o f  control flow (loops) require that 
the m ove is perform ed in other steps too, w hich m akes the problem  m ore com plex to solve.
4.12.2 .2  M ethod 2: R egister renam ing
A sim pler technique to im plem ent takes advantage o f  the fact that m ost o f  the execution tim e 
o f  a w ell-m apped program  is spent in kernels. Therefore, optim ising these alone should  be 
sufficient. A lso, kernels have the highest core utilisation, so will benefit m ost from  a less con­
strained allocation. The neighbouring steps to kernels tend to be very sim ple, perform ing tasks 
such as initialising  the loop iterator and addresses. So, adding com plexity  to these neighbouring 
steps will have little im pact. The control flow is also very sim ple: the step before the kernel 
alw ays passes control directly to the kernel, and the kernel always passes control either to itself 
or, once finished, directly to the step after the kernel.
46depending on the state o f  the m achine at the tim e o f  execution.
47due to allocation and routing needing to be perform ed on each attem pt.
115
Scheduling
The technique involves de-coupling inform ation in the kernel from  the steps before and after the 
kernel. Any inform ation that enters the step before the kernel must be transferred to  a different 
register w hen exiting that step (w here it enters the kernel). Similarly, any inform ation that exits 
the kernel must be transferred to a different register when exiting the step after the kernel.
A fter this de-coupling, the registers in the kernels can then be reallocated freely. These registers 
are then locked in place, but this will only affect the steps before and after the corresponding 
kernel. The cost is therefore increased register-to-register activity in the steps surrounding each 
kernel— roughly half o f w hich cannot be reallocated, but the kernels get a com pletely free 




T his section show s the results from  experim ents that w ere devised to dem onstrate  the follow ing:
Section 4.13.1: T he ability o f  the tree follow er scheduling algorithm  (section 4.9.2) to gen­
erate configuration contexts out o f  basic b locks, and their quality— in term s o f  resource 
utilisation/parallelism , and execution time.
Section 4.13.2: T he num ber o f additional registers that live register identification (section 4 .7 .1 ) 
m akes available for scheduling and other purposes, and the effect this has on the quality  
o f  the resulting  schedule.
Section 4.13.3: T he effectiveness o f  the register starvation avoidance schem es (section 4.10) 
in allow ing com plex basic blocks to be scheduled for highly resource starved cores, and 
the quality o f  those schedules.
Section 4.13.4: The effect o f  register renam ing (section 4 .12.2) on connection  lengths, when 
applied  to configuration contexts w ith high utilisation.
4.13.1 Results: Scheduling Algorithm
This section looks at the perform ance o f sequences o f  configuration contexts generated from  
a given basic block by the scheduling algorithm  presented  in section 4 .9 .2  on page 92, when 
subjected to certain  resource constraints.
T he experim ent is designed to take a kernel occupying a significant fraction  o f  a sm all R IC A  
core, and artificially restrict the availability o f  certain  key resources, to force the scheduling 
algorithm  to split the kernel up into m ultiple steps. T he quality  o f  the resu lting  schedule is then 
analysed, in term s o f  effect on step count, total critical path, and throughput.
The exam ple used is a 2-D  discreet cosine transform  filter (8x8 DCT-II) [80] com m on in 
JPE G /M PE G  im age com pression. The im plem entation o f this filter (in C) perform s the 1-D 
D C T  as a single loop (kernel), w hich is called tw ice— once to operate on the colum ns, and 
once on the rows. The com piler generates a single basic block fo r this inner loop, the resource 
requirem ents for w hich are shown in table 4.6. T he scheduler processes all o f  the basic blocks 
generated by the com piler, how ever this analysis will only look at this kernel, as it should 
represent the m ajority o f  the execution tim e.
As the base line, the scheduler is given a target processor with sufficient resources fo r the 
entire kernel to fit into a single step. T he experim ent then involves selecting som e key resource 
(instruction cells) types, and gradually reducing their availability (instance count). Properties 












w rite memory 8
register 35
Table 4.6: DCT kernel resource requirements, in terms of instruction cells on the target architec­
ture.
The follow ing resources w ere constrained in turn: m ultipliers (M UL), m em ory read (RM EM ), 
m em ory write (W M EM ), addition/com parison (ADDCOM P). For each series, only the corre­
sponding resource is constrained— all others are available in sufficient quantity. The instance 
count is then ram ped down, to show the resulting gradual degradation in term s o f execution 
speed (throughput— figure 4.43) and program  memory cost (step count— figure 4.41). The the­
oretical m inim um  step count can be calculated as follows:
. i (  T lrequired \stepsmin = ceil I     I
K ^a v a ila b le  /
w here n  represents instance count o f the constrained cell type. In all cases tested, the scheduling 
algorithm  m eets this theoretical minimum.
Figure 4.41: Step count resulting from multiple runs of the scheduling algorithm on the DCT ker­
nel, against availability of certain key resources. Starving the kernel of computation 
resources makes it map to multiple steps, to time-domain multiplex the available re­
sources. In each case, the scheduling algorithm produces the minimum number of 
steps possible with a given constraint (e.g. 10% availability => 10x fewer cells than 
needed => a minimum of 10 steps required).
118
Scheduling
Figure 4.42: Total critical path resulting from multiple runs of the scheduling algorithm on the 
DCT kernel, against availability of certain key resources. The total critical path is 
the sum of the critical paths of each of the steps produced. The increase indicates 
how much parallelism has been lost as a result of partially serialising the data paths 
into steps. The scheduling algorithm manages to keep this well below what would be 
expected from the increase in the number of steps, e.g. a 10 x reduction in addcom p  
cell availability results in a 10 x increase in steps (sec figure 4.41), but only a 4.5 x 
increase in critical path.
Figure 4.43: Throughput resulting from multiple runs of the scheduling algorithm on the DCT 
kernel, against availability of certain key resources. Throughput is based on the total 
measured execution time of the kernel, and includes the effect of step loading times. 
The decrease in throughput closer matches the increase in the number of steps, rather 
than the increase in critical path. For example, a 10 x reduction in addcom p results 
in a 10x increase in kernel steps (figure 4.41), but a 4.5 x increase in total critical 
path (figure 4.42), leading to a 10 x reduction in throughput. This is because the step 
load-time dominates in this example.
A dditionally, figure 4.42 shows the sum o f the critical paths o f the resulting  steps, w hich gives 
a clearer view o f w hat the scheduling algorithm  has done. This is related to execution speed, 
but execution speed is also further affected by the step loading tim es im posed by the additional 
steps. T he throughput (figure 4.43) and total critical path (figure 4.42) graphs are norm alised  to 
the base line (with no resources constrained).
The degree o f  overlap achieved is shown in figure 4.44. This shows the relative increase in 
critical path v.s. the relative reduction in resource availability. An overlap o f  0%  indicates a
119
Scheduling
Figure 4.44: Achieved extent of overlap of the critical path, against availability of certain key re­
sources. Overlap is the increase in total critical path v.s. the decrease in resource 
availability, both relative to the base line (unconstrained case). An overlap greater 
than zero indicates a net saving. The relative overlap is less than zero when the re­
sources are only slightly limited, indicating that some of the data paths that ideally 
would run in parallel are being separated out into a second step. This is unavoidable. 
The relative overlap improves as more of the data paths have to be brought into the 
second step, creating a more balanced schedule. When more than two steps are nec­
essary, there is a lot more room for the scheduling algorithm to improve parallelism, 
shown here by a positive relative overlap.
linear relationship. A value greater than zero indicates that the resource constraint is being 
absorbed, partially hiding its cost. It is calculated as follows:
Overlap    ^P baseline  ^a va ila b le
cP to ta l n requ ired
w here cp represents critical path, and n  represents instance count o f the constrained cell type.
Looking at the data flow graph for the basic block (figure 4.45), all the operations are part o f 
the same data path. Considering the case o f the m ultiplier (M UL), there are 12 instances of 
this operation with a start delay o f 3.46ns (3rd row in the DFG), 2 with a start delay of 5.45ns, 
and 2 with a start delay o f 11.94ns. W hen less than 12 instances o f the m ultiplier resource are 
available, the 12 operations in the 3rd row cannot all be perform ed together. These lie on the 
critical path, and so splitting the basic block into steps will have to increase the total critical 
path. If 8 m ultipliers are available (i.e. half the total required num ber), the lowest achievable 
total critical path will occur if 8 o f these 12 M UL operations are put in the first step, and the 
rem aining ones put in a second step along with the later M UL operations. The M U L operations 
should lie at the end of the critical path o f the first step (feeding their output into a tem porary 
register), in order to m inim ise the critical path o f the first step. The second step will have the 
sam e critical path as the original DFG.
120
Schedu ling
Figure 4.45: Data flow graph of the main kernel (L4) in the DCT example, highlighted according 
to which step each operation gets placed in. Three situations are shown, with the 
multiplier (M U D  resource constrained to: (a) 8 instances (2 steps), (b) 4 instances (4 
steps), (c) 2 instances (8 steps). Instances of the constrained resource type (M U D  are 
shown with a black outline.
Putting this into num bers: the 12 M U L operations earliest in the basic block D FG  appear 
at 3.46ns, and produce their output at 4.35ns. A dding a register to  store the result, gives an 
additional 1.44ns for interconnect delay and 0.1ns internal delay for the register. T herefore, the 
m inim um  critical path o f the first step is 4 .35 +  1.44 +  0.1 =  5.89ns. The entire basic block 
D FG  has a nom inal critical path o f 16.82ns— as will the second step. Therefore, the m inim um  
possib le increase in critical path is 5 .8 9 /1 6 .8 2  =  35% .
T he actual increase achieved by the scheduler was 37%  in this case, w hich is slightly  w orse than 
the theoretical m inim um  calculated above. Looking at the resulting schedule (figure 4.45(a)), 
w e can see why: the algorithm  has chosen to include one o f  the 2nd level m ultipliers in the first
Scheduling
step, artificially increasing the critical path. This is because the algorithm  used here searches the 
predecessors o f the active edge in arbitrary order. It should perform  better if the predecessors 
w ere visited in order o f  their output delay. The tool doesn’t im plem ent this approach because 
this im poses a significant increase in the execution tim e o f the scheduling algorithm . In its 
com m ercial use, the tool execution overhead was deem ed more significant than the effect on 
the quality o f schedule achieved.
4.13.2 Results: Live Register Identification
This section dem onstrates the effectiveness of the live register identification algorithm  (de­
scribed in section 4.7.1 on page 77) in increasing the num ber o f registers available for use by 
the scheduler. The scheduler infers additional registers (term ed temporary registers) w hen a ba­
sic block has to be split into m ultiple configuration contexts (section 4.8.1 on page 84). It also 
infers additional registers for use in certain assem bly-level optim isations, such as converting 
stack local variables into registers, or counter replication.
This is dem onstrated using two exam ples from  elsewhere in the thesis: a 2-D 8x8 D CT trans­
form, introduced in section 4.13.1, and a gam m a correction m odule test bench, introduced in 
section 5.8.2 on page 174. The D CT exam ple was chosen as it has a rather sim ple kernel, 
showing the low er-end o f  w hat live register identification will w ork on. The gam m a correction 
m odule was chosen as a m ore com plex exam ple, large enough for the com piler to have re-used 
a lot o f registers, m aking few er registers available for use as tem poraries.
Figures 4 .46 and 4.47 show the results for the DCT exam ple, and figures 4.48 and 4.49 show 
the results for the gam m a correction m odule exam ple. The graphs show the basic blocks in the 
corresponding program  along the x-axis, referred to by index in the program . The bars show 
the num ber o f instructions in each basic block, to indicate their com plexity. The lines show the 
percentage o f  all registers in the core that are available for use by the scheduler. The rem aining 
registers were either used by the com piler, or are not known to be safe to use.
The DCT exam ple was com piled and scheduled for a core with 64 registers, 16 o f w hich are re­
served for scratch. This gives ju st enough registers for the com piler to produce uncom prom ised 
data paths and blocks. Similarly, the gam m a correction exam ple was com piled and scheduled 
for a core with 250 registers and 35 scratch registers, for the sam e effect. This m inim ises the 
advantage o f  live register identification, leading to a more fair com parison. The m ore registers 
there are in the core (not reserved as scratch), the higher the potential advantage. This is be­
cause the num ber o f registers used by the com piler will rem ain constant, so the ratio o f  unused 
to used registers will increase, and the num ber o f which are unused is not know n w ithout live 
register identification.
W ithout live register identification, when basic blocks are split into steps, tem porary register 
assignm ent (section 4.8.1 on page 84) can only assign tem porary registers from  the pool o f 
registers that are active (i.e. written to or read from) in the basic block, plus scratch registers 
(which are reserved for use by the scheduler). All other registers might contain data that must 
be transported through the basic block for use later in the program . The effect o f live register 
identification on this is to determ ine which of these other (inactive) registers might contain 
im portant inform ation, m aking all the others available for use in storing tem poraries.
122
Scheduling
Figure 4.46: Registers available for storing temporary values inside each basic block in the DCT 
program, with and without live register identification. The bar graph shows the com­
plexity of each basic block (i.e. number of instructions). Without live register infor­
mation, only the scratch registers and those registers used by the compiler in that 
basic block are available for use as temporaries. The number of registers used by the 
compiler is roughly proportional to the complexity of the basic blocks in this exam­
ple, as shown by the ‘without live register info’ graph. Any registers not used by the 
compiler might store important data across that block (i.e. are dormant), and thus 
can’t safely be used. Live register identification determines which of these really are 
dormant, and makes the rest available for use as temporaries. This mostly benefits 
the least complex blocks, as the compiler uses fewer registers there.
Figure 4.47: Registers available over the boundaries between basic blocks in the DCT program, 
with and without live register identification. The bar graph shows the complexity of 
each basic block (i.e. number of instructions). Only the scratch registers are safe 
to use between basic blocks, as any register written to by the compiler could pass 
data between basic blocks. Live register identification determines which registers 
written to by the compiler actually store values across that block boundary, which is 
generally only a small subset of those written to, as can be seen here. Those which are 
not live on exit can be mapped entirely to wires by the scheduler (i.e. do not require 
temporaries), which particularly benefits the most complex blocks.
123
Scheduling
Figure 4.48: Registers available for storing temporary values inside each basic block in the gamma 
correction program, with and without live register identification. The bar graph 
shows the complexity of each basic block (i.e. number of instructions). The situa­
tion is similar to that in figure 4.46, except that the main kernel (block no. 6) in this 
example has a high degree of register re-use. Live register identification in this case 
makes a significant improvement in register availability for temporaries.
Figure 4.49: Registers available over the boundaries between basic blocks in the gamma correc­
tion program, with and without live register identification. The bar graph shows the 
complexity of each basic block (i.e. number of instructions). The situation is similar 
to that in figure 4.47—-most registers unused by the compiler do not store information 
between the basic blocks of the program.
Figure 4.46 shows a fair im provem ent in register availability for use as tem poraries in most 
blocks. The im provem ent is least significant in the basic blocks with the highest utilisation, 
e.g. the two kernels: L4 (index 3) and L6 (index 5). It is in these blocks where tem porary 
registers are going to be o f the most use. However, even a small increase in register availability 
can make it easier for the scheduler to form  a m ore efficient sequence o f  steps. A lso note that
124
Scheduling
the advantage increases linearly  as the total register count is increased. T he gam m a correction 
exam ple (figure 4.48) paints a different picture: the kernel in this exam ple (index 6) shows a 
significant increase in register availability. The results in section 4.13.3 show the effect that 
reg ister availability for tem porary registers has on the ability to schedule, and on the quality  o f  
the results.
For register availability betw een basic blocks, w ithout live register identification, the D FG  anal­
ysis (section 4.5 on page 67) m ust assum e that all registers that the com piler could  use store 
im portant data. As a result, only scratch registers (reserved for use by the scheduler) can safely 
be used betw een basic blocks. Live register identification im proves this situation in several 
ways: it determ ines w hich o f  the register that w ere inactive in the basic b lock  store im portant 
data through that block (i.e. dorm ant registers), it determ ines w hich o f  the b lock ’s input reg­
isters need to have their value preserved for use la ter in the program , and it determ ines w hich 
o f  the registers that w ere w ritten to in the basic block actually  pass inform ation to subsequent 
blocks. As a result, the im provem ent is m ore dram atic in these cases, as can be seen in fig­
ure 4 .47 and figure 4.49.
E xperience show s that the ability to identify w hich o f  the inactive registers are dorm ant m akes 
it possib le to elim inate the need for scratch registers. This m eans that for a given core, the 
com piler has m ore registers available to it, w hich helps it form  larger basic blocks w hich are 
better candidates for parallelisation.
A nother benefit o f  live register identification— not looked at here— is the effect on reduced 
activity on the core: By identify ing w hich registers w ritten to  in the basic block actually  carry 
data out o f the basic block, the num ber o f connections to registers is reduced, w hich im proves 
routability.
4.13.3 Results: Register Starvation Avoidance
A gam m a correction filter [81, 82]— a com m on m odule in a typical im age signal processing 
(ISP) p ipe [83]— was used to dem onstrate the onset and effectiveness o f each register starvation 
avoidance technique described in section 4.10 on page 95. The m odule perform s all o f  the 
pixel-level w ork in a single kernel, w hich loops over all pixels in the im age. This m odule was 
chosen since it is large enough to nearly fill the exam ple core, and has several independent 
data paths (i.e. one fo r each o f the 6 channels processed). This m eans that there are few er 
data dependencies, leading to larger scope for rearrangem ent. The rearrangem ent o f  data paths 
perform ed by shuffle is done in a random  m anner, so this increased freedom  should exaggerate 
any critical path overhead. The register instance count was artificially reduced to an extent 
that puts significant pressure on the scheduling algorithm , both w ith and w ithout live register 
identification, but high enough so that the com piler doesn’t m ap all the local variables to the 
stack (w hich w ould take them  out o f  the scheduler’s control).
T he com piler’s choice o f register assignm ent ensures that register starvation does not occur 
betw een basic block boundaries, o r internally to  the basic block if the instruction stream  is 
follow ed exactly. Scheduling o f the basic block DFG onto the target core turns m any o f  the 
registers used internally in the basic block into wires. If  the core has sufficient resources to 
perform  all the operations o f the basic block in one configuration context, then no additional
125
Scheduling
registers are needed after parallélisation by the scheduler. However, if  availability o f  one or 
m ore resources cause the basic block to be split into m ultiple contexts, additional registers 
(tem porary registers) are needed to bring values across the boundaries betw een steps resulting 
from that basic block. Therefore, in order to induce the scheduler to  use more registers than 
stipulated in the assem bly, the target core m ust have insufficient resources to map the basic 
block into a single step.
To this end, the target core was given resources sufficient for the main filter loop except fo r one 
resource— A d d /c o m p — w hich was given only 5 instances out o f the 61 required. This m eans 
that the scheduler m ust split the basic block into at least c e il(6 1 /5 ) =  13 steps. Since each 
o f these steps m ust be executed in sequence for every pixel of the im age, this severely lim its 
the throughput. This leads to a throughput o f a m ere 7.2M pixels/s in the case w here there are 
am ple registers, and low er once the registers becom e constrained.
The experim ents involve using the sam e assem bly for each test, w ith a different register count 
being given to the scheduler each tim e. U sing the sam e assem bly means that the data flow 
graph o f the kernel rem ains the same. Therefore, all changes in the resulting netlist w ill be 
as a result o f  the scheduler’s actions. W hen building this assembly, the register count m ust be 
chosen carefully. This has to represent the lowest register count to be tried by the scheduler, 
otherw ise the assem bly will contain active registers that don’t exist in the target architecture.
In order for register starvation to occur at all, the num ber o f internal values needing to be 
preserved across the internal step boundaries m ust exceed the num ber o f registers available for 
this use. However, if  the register count is too low, then the com piler will push excessive values 
on the stack, m aking the kernel m em ory bound, causing the m em ory accesses to dom inate 
the execution time. This would obscure the effect o f changes to the critical path. As a  fair 
com prom ise, a register count o f  32 was given to the compiler. Table 4.7 shows the resources in 











stream  buffer 12
register 11
read memory 1
Table 4.7: Gamma correction filter kernel resource requirements, in terms of instruction cells on 




T he scheduler w as then run several tim es on this assem bly, using a range o f  different register 
counts. T he range was chosen such that the m axim um  register count was m ore than enough for 
the basic block to schedule w ithout any problem , and the m inim um  register count is equal to the 
register count used to generate the assem bly. T he experim ents w ere repeated  w ith live register 
identification disabled, so that only the 11 registers active in the m ain loop (plus an additional 4 
scratch registers) are available for use as tem poraries. A  valid schedule w as eventually  achieved 
in each case.
□  R ew ind 
[3  S h u ffle  
m split
R eg ister S ta rv atio n  A void an ce - W ith Live R e g iste r Iden tificatio n
^  n fl
44 43 42 41 40
Register instance count
Figure 4.50: Total number of occurrences of each of the register starvation avoidance techniques 
when scheduling the gamma correction module’s main loop basic block, for a range 
of different register instance counts, with live register identification enabled. In this 
example, register starvation occurs only when the register count is below 44. As 
the register resource becomes more constrained, the number of rewind attempts in­
creases, and the more severe avoidance techniques become increasingly necessary.
F igure 4.50 shows how  m any tim es each o f  the register starvation avoidance techniques w ere 
instigated w hen scheduling the m ain loop basic block. W hen register starvation is encountered, 
rew ind (section 4.10.1 on page 96) is perform ed as the first attem pt at avoidance. Rew ind can be 
perform ed several tim es before a valid schedule is found. If all suitable rew ind points have been 
tried to no avail, shuffle (section 4.10.2 on page 98) is perform ed next, up to a total 10 tim es. If  
a valid schedule still cannot be achieved, basic block splitting (section 4.10.3 on page 100) is 
perform ed as a last resort. These techniques are perform ed as three nested loops, so rew ind is 
tried again after a shuffle has been perform ed. Sim ilarly, up to a further 10 shuffle attem pts may 
be perform ed after each split. The bars in the figure show  the num ber o f  tim es each technique 
was tried in total before a valid solution was finally landed upon. As one w ould expect, the 




Figure 4.51: Total number of occurrences of each of the register starvation avoidance techniques 
when scheduling the gamnia correction module’s main loop basic block, for a range of 
different register instance counts, with live register identification disabled. Without 
live register identification, more registers are needed, and thus starvation occurs at 
a much higher instance count (compared to figure 4.50). All instance counts tried 
here are deep into starvation (comparable to the most extreme case in figure 4.50), 
requiring multiple attempts of each avoidance method. Despite this, a valid schedule 
is obtained from each case.
As a com parison, figure 4.51 shows the avoidance behaviour when live register identification is 
disabled. The same num ber of registers are available for use in tem poraries irrespective o f  the 
total register count, so in each case the scheduling algorithm  struggles in a sim ilar m anner to 
the most register starved case when live register identification is enabled.
Figure 4.52 shows how many steps the basic block is scheduled into, in each case. As can be 
seen, when live register identification is enabled, even in the worst cases only a single additional 
context was generated, despite the basic block needing to be split several times before a valid 
schedule could be obtained. Figure 4.53 shows the sum o f the step critical paths for the resulting 
schedule for each register instance count tried. Generally, the data path parallelism  seems 
to stay largely intact, w ith a m axim um  increase o f 17%. Figure 4.54 shows a sim ilar story, 
this tim e m easured in term s o f  throughput. This is based on m easured execution tim e, which 
includes the step load time overhead. The worst case yields an 11% reduction in throughput. 
The resulting affect on critical path and throughput doesn’t directly correlate with the difficulty 
in obtaining a valid schedule. This is indicative o f the (increasingly) random  nature o f  these 
avoidance techniques.
W hen live register identification is disabled, the situation is even m ore severely constrained, 
yet register starvation avoidance eventually yields a valid schedule with a 17% increase in total 
critical path, and a 17% reduction in throughput.
128
Scheduling
Figure 4.52: Number of steps resulting from the scheduling of the gamma correction module’s 
main loop basic block, over a range of register instance counts. Without live reg­
ister identification, the difficulty in achieving a valid schedule is similar across all 
instance counts. This is reflected by the constant overhead in steps. With live register 
identification enabled however, most cases show no step overhead. Since the rewind 
attempts are random, and the first attempt to produce a valid schedule is chosen, it is 
often possible to achieve a better schedule in a more constrained case.
Figure 4.53: Change in total critical path of the resource constrained gamma correction module 
(compared to the unconstrained case), over a range of register instance counts. To­
tal critical path is the sum of the critical path of each step produced for the kernel, 
which generally increases as the situation becomes more register constrained. With­
out live register identification, the situation is deep into starvation for all register 
counts shown, and the quality of the resulting schedule is nearly identical in each 
case. With live register identification, a rise in total critical path can be seen as star­
vation gets worse. The step count increase (figure 4.52) lags the critical path increase, 
showing that the random rewind attempts often lead to the data paths in a step being 
made more combinatorial in preference to pushing them into a later step.
This exam ple dem onstrates that the register starvation avoidance techniques allow  a valid sched­
ule to be obtained on a core w ith 12 (i.e. 44 -  32) few er registers than can be achieved w ith ju s t 
the scheduling algorithm  on its ow n— i.e. a 27%  im provem ent in schedulability  in this case. It 
is believed that this may scale w ith the com plexity  o f  the basic block data flow graph. As o f  the 




Figure 4.54: Change in throughput of the resource constrained gamma correction module, over a 
range of register instance counts. The throughput generally reduces as the situation 
becomes more register constrained, and is affected by a combination of the increase 
in step count and the increase in total critical path. In any case, the reduction in 
throughput is mild.
4.13.4 Results: Global Register Reallocation
A com plete 3rd party im age signal processing (ISP) pipe was used to investigate the effect o f 
register renam ing (introduced in section 4.12.2 on page 115) on routability and path lengths. 
The exam ple ISP has a com plexity o f around 800 operations per pixel, and targets a R IC A  array 
o f 1000 cells. The m odules o f the ISP are grouped into 3 kernels, w hich are executed one after 
another for each line o f the image.
M apping the connections o f  each step onto the real device requires the generation o f paths along 
the interconnect resources. To reduce the path lengths, cells are reallocated to bring connected 
cells closer together. Cells w hich have internal state, such as registers, m ust be reallocated 
consistently across all steps in the program. The m apping tool achieves this by choosing a 
fixed allocation for these cells the first tim e they are used. This m eans that their locations are 
optim um  only in the first step where they are used. To m inim ise the consequences o f  this, 
the steps are operated on in descending order o f com plexity48, so that the steps that are more 
difficult to route are given the least restrictions. Typical steps contain many connections to and 
from registers, so their positions (allocation) have a large effect on the path lengths in the step 
overall.
Interconnect utilisation Average ex. delay (ns)
Kernel Connections Disabled Enabled D isabled Enabled
L I 048 (first) 825 16.63% 16.36% 0.966 0.963
L 9 17 (second) 710 21.92% 15.39% 1.55 1.10
Table 4.8: Post-routing statistics for the two most complex kernels in a 3rd party ISP pipe, with 
register renaming either disabled or enabled. L1048 will be freely reallocated in each 
case, but L917 will have some fixed allocations when register renaming is disabled.
48in term s o f  connection count.
130
Scheduling
Figure 4.55: Histogram of path lengths for each connection in the L1048 kernel, with and with­
out register renaming. Registers in this kernel are freely reallocated in both cases, so 
differences are the result of variation between runs. The bins are distributed logarith­
mically. Register renaming can be seen to make shorter paths more common, with no 
paths longer than 4.5ns (compared to a maximum at 7.5ns without renaming).
Figure 4.56: Histogram of path lengths for each connection in the L917 kernel, with and without 
register renaming. Some registers in this kernel have a fixed allocation when register 
renaming is disabled. The bins are distributed logarithmically. Register renaming 
can be seen to make shorter paths more common, with a single outlier at 14ns (com­
pared to a maximum at 25ns and several in the 20-22ns range without renaming).
Register renam ing uses the global register reallocation inform ation (obtained using the algo­
rithm  described in section 4.12.1 on page 113) to m ake all registers freely reallocatable in each 
kernel. W ithout this inform ation, only the first kernel to be operated on is able to freely real­
locate registers. The exam ple was chosen based on it being the available application w ith the
Scheduling
largest kernels (and thus the m ost variation in interconnect path length and placing the m ost 
pressure on the interconnect), with m ore than one kernel so as to dem onstrate that optim isation 
is possible on each kernel individually (as opposed to perform ing a global reallocation tailored 
for one kernel, but potentially com prom ising the others).
R esults w ere obtained using a m apping tool that is still under active developm ent. R esults are 
lim ited to the two m ost com plex o f  the kernels. Excessive run tim es prevented sufficient runs 
to be perform ed to com pletely characterise the variance betw een runs. Therefore, these results 
should be considered to be m erely indicative.
Table 4.8 shows pertinent connection statistics for this exam ple, for two runs— with and w ithout 
register renam ing. F igure 4.55 and figure 4.56 give a m ore detailed view o f the resulting con­
nections for the two kernels, in the form  o f histogram s. These show the num ber o f  connections 
that lie w ithin each range (bin) o f  critical paths displayed on the x-axis. The bins are distributed 
logarithm ically to show m ore detail in the lower lengths, w here most o f the connections lie.
As expected, the results show little variation betw een the two runs for the first kernel (L I048), 
w here all registers are reallocatable. The relative difference in average path length is 0.31% , 
and the relative difference in interconnect usage is 1.6%. This means that m ost o f the difference 
seen in subsequent steps should be attributable to the level o f freedom  to reallocate. The cor­
responding histogram  (figure 4.55) shows very sim ilar distributions, w ith very few connections 
lying outside o f the bell. The m axim um  connection length is 7.4ns.
For the second kernel (L917), w ithout register renam ing, a significant num ber o f registers are 
locked in a sub-optim al allocation. Register renam ing should free these. The results show a 
30% decrease in both average path length and interconnect utilisation. The connection length 
histogram  (figure 4.56) consistently shows more connections lying in most o f the low er bins, 
and a drastic reduction in the num ber o f connections in all the higher bins.
If m ore kernels were processed, w ithout register renam ing, the num ber o f registers locked in 
place will gradually increase as each successive kernel is processed. Therefore, the im prove­
ment w ith register renam ing should becom e increasingly apparent. A lso, it should be noted that 
this exam ple was not pipelined. Enabling pipelining significantly increases the num ber o f con­
nections involving registers, and as a result, register renam ing should affect a larger percentage 




This chapter looked at algorithm s used in the process o f  converting a program  described in a 
high-level language into configuration contexts for any arbitrary R IC A  core. T he program  is 
first com piled  using a conventional industry-standard  com piler (G C C ) using a custom  back­
end w hich specifies an instruction  set m atching the capabilities o f  the individual cells present 
in the target array. The algorithm s presented in this chapter are used to convert this assem bly 
into configuration contexts, reconstructing the parallelism  inherent in the basic b locks o f  the 
program , w hilst adhering to the available resources.
The process o f  extracting parallelism  from  basic blocks involves inferring  additional registers. 
A  series o f algorithm s w ere in troduced that allow  register life tim es to be derived from  the 
assem bly, and to w ork around register starvation by gradually  reducing the am ount o f  paral­
lelism  until a valid schedule can be obtained. R egister lifetim e inform ation was also used to 
perform  register reallocation to im prove the m apping onto the reconfigurable fabric, reducing 
the interconnect paths.
D ata path m achines are able to perform  operations sequentially, in parallel, or com binatorially  
(i.e. operation chaining). The data paths produced by the com piler often consist o f  many 
operations chained together, w hich can lead to long critical paths after parallélisation. These 
lim it the achievable throughput. The next chapter looks at w ays to significantly im prove upon 





Stream ing applications such as real-tim e signal p rocessing dem and high throughputs, and are 
becom ing  increasingly prevalent in low -cost em bedded system s, such as m obile phones. To 
m eet the tough throughput and area requirem ents, A SIC s have usually  been the best solution 
on a perform ance/cost basis. However, as the cost o f  A SIC  design and m anufacture ever in ­
creases, and as custom ers dem and m ore and m ore functionality , separate A SIC s fo r each set 
o f  stream ing algorithm s— w hether d iscrete or as part o f  a larger SoC— becom es a costly  exer­
cise. M any o f  these features are not used at the sam e tim e, so there is room  for silicon re-use. 
Furtherm ore, vendors often w ish to d ifferentiate their products by providing a different set o f 
algorithm s for a particular feature com pared to their com petitors. A  reprogram m able solution 
addresses these problem s by decreasing the N REs o f developing the initial A SIC  (by sharing it 
am ongst a w ider audience), and allows different applications to use the sam e silicon at different 
tim es.
As discussed in earlier chapters, there are two m ain fam ilies o f reprogram m able solutions that 
are able to m eet these throughput requirem ents: S1MD architectures (such as m odern G PU s), 
and reconfigurable data path architectures (such as em bedded FPG A s). S IM D  architectures 
achieve high throughputs by operating on several iterations o f  data at once (d istributed across 
several execution units), m aking up for the lack o f  perform ance o f  each unit (a m icroprocessor). 
How ever, this is only possib le on em barrassingly para lle l  [84] algorithm s, w here there are few 
data dependencies (i.e. finite im pulse response (FIR) filters). D ata path architectures operate 
on a sm aller batch o f  data at once (often ju st a  single iteration), but the latency o f  each iteration 
is significantly lower, thus achieving a higher throughput per unit. O perating on a sm aller batch 
size m eans that data path architectures have a h igher to lerance to infinite im pulse response (HR) 
filters.
C oarse-grained reconfigurable data path architectures, such as instruction cell based processors 
[5][6], are better suited to em bedded system s than FPG A s, as the rou ting/com putation  area 
ratio  is lower, and the sm aller configuration sizes reduce the program  m em ory footprint. The 
configuration size also allow s these devices to be reconfigured m uch m ore rapidly, thus allow ing 
them  to im plem ent control flow sim ilar to a m icrprocessor.
This chapter looks at ways to im prove the throughput o f  stream ing applications on coarse­
grained reconfigurable architectures that support a high degree o f  operation chaining. Perfor­
m ance is optim ised by attem pting to m atch the size o f  each kernel— the inner loop w here m ost 
o f  the execution tim e is spent— to the available resources, allow ing them  to fit into a single con­
figuration. This allow s the configuration to persist for m any clock cycles, operating  on new  data 
on each cycle. This increases throughput, since no tim e is spent having to reconfigure the core 
betw een successive iterations. It also decreases pow er consum ption, as the configuration only
135
Pipelining
Figure 5.1: Typical program running on a dynamically reconfigurable processor—the program 
consists of several steps running in sequence, in which some can loop back to them­
selves (i.e. are kernels). The kernel can be pipelined, increasing its throughput.
needs to be fetched from  program  m em ory (or cache) once— upon first entering the kernel—  
rather than on every iteration. However, the resulting data paths can often have a long critical 
path, leading to poor temporal utilisation o f the functional units, since they have to w ait until 
all functional units have com pleted before operating on the next batch of data, w hich lim its the 
throughput.
Pipelining provides a way o f starting to operate on a new batch o f data before an old one has 
com pleted. This allows the functional units o f m ultiple stages of the kernel to be active concur­
rently; each operating on a different batch o f data. The technique allows com plete kernels that 
w ere m apped to a single configuration context, to have their critical path length decreased by 
the addition of pipeline stage registers, as illustrated in figure 5.1.
Section 5.3.1 describes an algorithm  to perform  pipeline stage allocation, based on a given 
target critical path constraint. Section 5.4 shows how properties o f dynam ic reconfiguration can 
be used to fill and flush the resulting pipeline. This is an entirely software approach. Section 5.5 
proposes a second approach, which introduces som e changes to the hardw are, to allow filling 
and flushing to be incorporated into the single kernel configuration context, thus significantly 
reducing the program  m em ory overhead— especially for very deep pipelines. Section 5.6 details 
how the task can be com pletely autom ated, by first suggesting how to identify loops that can
136
Pipelining
be pipelined, and second by in troducing an algorithm  fo r finding the optim al target critical path 
constrain t. Section 5.7 shows how further im provem ents can be m ade by in ternally  pipelining 
the hardw are o f the instruction cells, thus reducing the m inim um  possib le p ipelined  critical 
path. Section 5.8 shows the result o f applying these techniques to a real-life kernel used in 
im age processing.
5.0.0.1 A im s
•  A utom atically  pipeline com pute-intensive loops to significantly increase throughput.
5.0.0.2 O bjectives
•  A utom atic pipeline stage assignm ent, based on a user-supplied target critical path con­
straint.
•  M inim al hardw are changes.
•  M inim ising the im pact o f p ipelining on the context configuration size.
•  M inim ising the im pact o f p ipelin ing on the overall program  size.
•  A utom ating the choice o f target critical path.
5.0.0.3 N ovelty
•  C om bining A SIC  design techniques (structural pipelining) and rapid dynam ic recon­
figuration to achieve autom atic p ipelining o f  kernels, in a  m anner sim ilar to softw are 
p ipelining. D ynam ic p ipe lin ing  (section 5.3), using a Pipeline stage allocation a lgorithm  
(section 5.3.1 ), and Autom ating the choice o f  tim ing constraint (section 5.6).
•  A m ethod for rem oving the need for separate p ipeline fill and flush contexts w ith m inim al 
addition o f  hardw are— Single-step  p ipelin ing  (section 5.5).
•  Support f o r  p ipelines involving internally p ipe lined  cells  (section 5.7).
A nother way to look at the first item is that the structural p ipelin ing allow s a custom  execu­
tion unit pipeline to be obtained, that best m atches the algorithm . T hen conventional softw are 
p ipelin ing is used to partition the softw are— the instructions o f  that kernel—  to fit that cu s­
tom  pipeline. The second item  is then how  to im plem ent hardw are predication in a m anner 
that allow s a single configuration context to perform  all the phases o f  a softw are pipeline: fill, 
loop, and flush. The last item is useful for further im proving the throughput by hiding long 
com binatorial operations or m em ory access latency.
137
Pipelining
5.1 Background: Structural Pipelining
Various approaches o f pipelining data paths have been proposed [85, 86]. These require that 
the designer specifies a  throughput constraint, in order to allow the algorithm  to best make 
the choice betw een throughput and the area overhead each pipeline stage introduces. These 
approaches describe various algorithm s for the task o f pipeline stage allocation, applied to 
a num ber o f different levels in a design, from  high-level m odules described at the behavioural 
level, down to the operation-level [87], This is possible due to the hierarchical nature o f designs 
described via hardw are description languages (HDLs). In the context o f com puting, p ipelining 
can be applied at the structural-level, where data paths are defined betw een abstract building 
b locks— which map to the functional units. In addition to constraints im posed by the designer, 
[85] describes how the presence o f  feedback loops in a design lim it the extent to w hich p ipelin­
ing is possible. An im portant part o f the optim isation process is therefore to m inim ise the 
presence o f  feedback loops. Techniques such as retim ing [88] may be used for this purpose.
In an ASIC environm ent, pipelining is often done together with com ponent selection, in order 
to make the additional trade-off between the cost o f higher perform ance com ponents, and area 
for equivalents m ade from  pipelining lower-cost com ponents [89]. This can also be seen in 
an FPG A  environm ent [90, 91 ], w here the choice can be m ade betw een scarce special-purpose 
resources or synthesising the functionality using configurable logic blocks, pipelined to achieve 
sufficient throughput. In a com puting environm ent, com ponent selection could be viewed as 
the choice o f w hich instruction expansions and m anipulations the com piler should perform  in 
order to best m atch the available functional unit resources, or in resource selection for custom  
cores. The key difference in com puting environm ents com pared to A SIC/FPG A  environm ents 
is in the level o f re-use o f functional units, and the tim e scale over which they are re-used.
For architectures that support instruction chaining, scheduling involves m apping as m any de­
pendent and independent data paths into as few configuration contexts as possible [57]. Inde­
pendent data paths run in parallel, so the tim e for which a configuration persists is determ ined 
by the m axim um  critical path length o f these data paths. If sufficient functional unit resources 
are available, loops can be optim ised by loop unrolling [92]— i.e. placing m ultiple iterations 
as independent data paths in the same configuration. This allows m ultiple iterations to begin 
and end at once. This does not change the original critical path length, yet can increase the 
throughput.
The throughput is determ ined by the critical path length o f a loop iteration and the num ber 
o f iterations that can be perform ed at once. During each execution o f the loop configuration 
context, data propagates through the operation chains until the final result is ready. This means 
that the functional units involved in that chain are only perform ing useful w ork for a fraction 
o f  the time. This is where loop pipelining techniques [7, 8, 9] com e in, w here the data paths 
are structural-level pipelined— to artificially reduce the critical path length by allowing new 
iterations to begin w ithout waiting for the com pletion o f previous iterations. This can be thought 
o f as successive iterations of the loop being replicated in hardware, but offset from each other 
to deal with the data dependencies between the iterations.
138
Pipelining
T he sam e technique can be applied to SIM D  architectures: the algorithm  can be split into 
p ipeline stages, and each stage m apped to a different execution unit. This reduces the execution 
tim e per iteration for each unit, w hich allow s infinite im pulse response filters to be acce lera ted .1 
This also reduces the code size for each unit, w hich can be useful in fitting m ore com plex 
algorithm s into the available resources o f each unit. This is used in large finite im pulse response 
filters, w here the pipeline can be replicated— each operating on a different independent batch 
o f  data. This increases the throughput up to sim ilar levels as having each unit perform  the 
com plete operation on a separate batch o f  data, but requires less code per execution unit.
5.1.1 Background: Software Pipelining
W ith increases in the num ber o f functional units available in m icroprocessors and DSPs, tech­
niques for m axim ising the utilisation o f  these function units have been devised, and are referred 
to as softw are pipelining [93, 32], This involves rescheduling the instructions in a loop iteration 
such that m ultiple iterations are in progress concurrently— each at a d ifferent level o f  com ple­
tion. This is done in such a way as to m inim ise the initiation interval— i.e. the rate at w hich 
new  iterations are begun, w hich directly determ ines the throughput. If  the available functional 
unit resources exceed the requirem ents o f the operations in the loop, the loop m ay be unrolled. 
Each unrolled iteration uses a d ifferent set o f  registers, w hich prevents certain dependencies 
from  conflicting w ith previous in-flight iterations. This allow s the in itiation  interval to be fur­
ther reduced, and hence increases the throughput. A dditional instructions m ust be added before 
and after the softw are p ipelined loop, in o rder to fill and to flush the pipeline. T hese are called 
the pro logue  and epilogue, respectively. This is illustrated in figure 5.2.
¿mm*! ¿3
il
^ » ■ l
E = "1
Figure 5.2: Software pipelining: simple loop running on a processor with two functional units 
that can run in parallel, but are fed from a single instruction queue. The colours show 
which instructions could be executed together without causing a pipeline stall (bubble).
'a lb e it only by the num ber o f  p ipeline stages.
139
Pipelining
However, there is a lim it to the depth o f fixed pipelines that can be created w ithout com prom is­
ing the perform ance over a w ide range o f different applications [94], There is also a practical 
lim it on the degree o f independent instruction-level parallelism  that can be m ade use o f  [95], 
w hich places a lim it on the num ber o f functional units that can be made available. This practi­
cal lim it can be avoided by allow ing sequences o f dependent operations to be chained together 
and executed in a single cycle— i.e. dependent instruction-level parallelism . Such sequences 
are statistically m ore com m on than there being sufficient independent data paths in a kernel to 
m ake full use o f  a large num ber o f independent functional units. This is an approach that is 
being taken in som e m odern VLIW s/U LIW s [96, 31] and in the design o f processors found in 
highly m ulti-core fabrics [ 1 ].
The rem ainder o f this chapter describes how dynam ic reconfiguration and operation chaining 
can be used to create custom  hardw are pipelines, that best m atch a particular kernel’s require­
m ents. Softw are p ipelining techniques are then used to best map the kernel to this custom  
pipeline. This is done as part o f the configuration— i.e. pipelines tailored to the particular ker­
nel are rendered onto the core at run-tim e. This has the same effect as adding pipelining in 
hardw are, but can be changed at run-time.
5.2 Preconditions
An operation is som ething that creates a value in the data path. O perations m ap to physical 
functional units in the core, o r registers. O perations create values that are transferred through 
the routing network to  other functional units in the core. Receivers o f these values are either 
other operations, or global output registers— i.e. registers assigned by the com piler to make 
a value available in a later basic block. Reading from  a global input register— i.e. a register 
assigned by the com piler to bring in a value from  a previous basic block— is also counted as an 
operation, although w riting to a global output register is not. Instead, each operation specifies 
which global output registers store its value (if any). This m irrors the inform ation captured in 
norm al assem bly notation.
This w ork assum es that the target architecture exhibits the follow ing properties:
•  Sufficient functional units exist in the silicon to allow a kernel to be m apped into a single 
configuration.
•  O perations can be chained together directly through the interconnect network, o r via 
registers. For sim plicity, it is assum ed that any direct connection can be replaced by a 
register; although this restriction could be easily overcome.
•  Registers introduce a delay o f one iteration into the data path— i.e. any value w ritten to a 
register first becom es available in the next cycle.
•  A rbitrary program  flow control (branching) is supported, such that if the value o f  the 
program  counter is modified, the configuration at that address is loaded on the next cycle. 




It is assum ed that the kernel is intended to  run for m any iterations. S ince the additional con­
figuration contexts introduced with pipelin ing— i.e. the prologue and ep ilogue (in troduced  in 
section 5.4)— have to be loaded and executed once each, irrespective o f  the num ber o f  required 
iterations o f  the kernel, the total tim e spent configuring the core is likely to be h igher when 
pipelin ing is used. Therefore, the iteration count o f  the kernel m ust be high enough so that this 
increase in configuration tim e is offset by the decrease in total execution tim e for the sam e num ­
ber o f  iterations o f the kernel, resulting from  the decrease in critical path length in the kernel 
loop context. Furtherm ore, it is im possib le for the pipelined design to perform  few er iterations 
than the num ber o f  p ipeline stages present. The m inim um  control flow path w ould be to exe­
cute each o f the prologue contexts in sequence, then the kernel loop context once, then each o f  
the epilogue contexts in sequence— during w hich there are as m any in-flight kernel iterations 
as there are p ipeline stages (see figure 5.5).
T he tool chain in the current im plem entation  consists o f  a com piler that produces assem bly 
consisting  o f  operations that m ap to the functionality  o f cells in the core, and a scheduler that 
extracts the parallelism  from  the basic blocks o f the assem bly, creating the netlist that defines 
each o f  the configuration contexts that can be loaded onto the target architecture. P ipelin ing is 
perform ed on the netlist. Since this is done outside o f  the com piler, m uch o f  the inform ation 
available as part o f  the com piler’s data m odel is no longer available by this stage. Therefore, 
m uch o f  the configuration cannot sensibly be m odified w ithout unexpected adverse effects. This 
lim itation, although avoidable, aids in m aking the proposed technique m ore general.
F or use as part o f a re-targetable tool chain, it is desirable to have as few  in-built assum ptions 
or special case logic as possible. The special case logic used in this w ork has been reduced to 
ju s t the following:
In addition to  the dependencies im plied by the connectivity  betw een the function units relating 
to the operations in the kernel, it is also necessary to  capture other dependencies im plied by the 
original o rder o f execution. These are dealt with by adding constraints, to preserve the original 
tem poral order. Exam ples include volatile  operations— i.e. operations o f  a type that requires 
the execution order o f operations o f  that sam e type to be preserved— and potentially  aliasing 
read/w rite operations on the data memory.
T he execution count o f  each operation rem ains unchanged after p ipelining. This is a sim ple 
way to ensure that side effects (e.g. state changes) o f each operation rem ain unchanged, w ithout 
having to explicitly  define w hich operations can have side effects. T here are a few cases w here 
particu lar side effects require special treatm ent (e.g. the jum p  operation and registers).
T he jum p  operation has an im m ediate side effect o f causing the kernel to exit, and passes exe­
cution onto  the pipeline flushing contexts (epilogue— introduced in section 5.4). T herefore, the 




An im portant side effect o f registers transferring values from one cycle (and thus iteration) to 
the next, is that any chain o f operations that reads from  a register, then writes back to the sam e 
register (i.e. a feedback loop), m ust be placed in the same pipeline stage. O therw ise, it w ould 
take m ore than one iteration to update the value o f the register, thus causing the pipeline to 
operate on garbage for som e iterations. A feedback loop is also possible involving access to 
data memory, w hich also introduces a delay o f  one cycle. However, use o f data m em ory for 
feedback is discouraged, since the latency cannot be hidden.
Since the critical path o f the pipeline as a w hole is dictated by the m axim um  o f the pipeline 
stage delays, and feedback chains (and the jum p chain) must be placed into a single pipeline 
stage, the length o f such chains dictates the m inim um  possible overall critical path length—  
and thus dictates the m axim um  possible throughput. Therefore, the com piler should ideally 
perform  optim isations that m inim ise the length of the feedback chains. The description o f  the 
target architecture m ust include cost w eighting for each functional unit, so that the com piler 
can calculate the critical paths.
The user specifies the desired throughput target for the kernel basic block, by specifying the 
desired critical path length o f the kernel loop. This is done via special m ark-up in the assembly. 
The pipeline stage allocation algorithm  uses this to determ ine w here to insert pipeline stage 
registers, and thus indirectly determ ines the num ber o f  pipeline stages to generate. Pipelining 
is only applied at the request o f the user, via the presence o f the aforem entioned m ark-up, 
since the iteration count is unknown in the current data model. For reasons o f generality, no 




5.3 Contribution: Dynamic Pipelining
C onventional structural-level p ipelining can be applied to single configuration context kernels 
w ith long critical data paths, in order to reduce the critical path, and thus increase throughput. 
This is done as part o f  the configuration— i.e. pipelines ta ilored  to the particu lar kernel are 
rendered  onto the core at runtim e. This is done using existing register resources in the core to 
delay values for a  single execution cycle, allow ing values to be bridged across p ipeline stage 
boundaries.
S tructural pipelining is applied to the kernel basic block by first assigning each operation in the 
orig inal data flow graph to a pipeline stage. Then, registers are introduced to store values over 
boundaries betw een pipeline stages. Only those values that are used in later pipeline stages are 
stored. A new  register is needed for each value fo r each p ipeline stage boundary  over w hich it 
m ust persist. F igure 5.3 shows an exam ple kernel before and after structural-level pipelining. 
T he exam ple includes only sim ple feedback chains consisting  o f  a sim ple increm ent o f the 
value o f  a register, how ever m ore com plex feedback chains are also possible.
(a)
Figure 5.3: Example kernel data flow graph, (a) before pipelining, (b) after pipelining (kernel loop 
context). The inserted pipeline stage registers are shown in red. The per-cycle critical 
path is shown in bold, and is shorter in (b), which allows for a higher throughput.
143
Pipelining
5.3.1 Contribution: Pipeline Stage Allocation Algorithm
First, constraints are defined between operations, w here the order o f execution is im portant. 
Exam ples include sam e stage or earlier constraints between operations reading from  input 
registers and operations that have those same registers marked as global output registers, and 
sam e stage or earlier  constraints between data memory read operations and potentially  aliasing 
data memory w rite operations. All operations in a feedback chain must be placed in the sam e 
pipeline stage, since such chains require single-step total latency in order to  keep the pipeline 
full. The algorithm  for assigning pipeline stages to each operation is as follows:
•  Identify the ju m p  operation, and all o f its dependencies. Save this in a set— the ju m p  
chain set.
•  Create the remaining set— a record o f those operations yet to be assigned to a p ipeline 
stage. This is initially populated with all the operations except for those in the ju m p  chain 
set.
•  Define the constraints:
-  A dd same stage or earlier  constraints betw een operations reading from  input regis­
ters, and operations that have those sam e registers m arked as global output registers.
-  Add sam e stage or earlier  constraints between data m em ory read operations and 
potentially aliasing data m em ory write operations.
-  Add sam e stage or earlier constraints betw een volatile operations of the sam e kind, 
to ensure that they still appear in their original order.
•  D etect feedback chains:
-  Identify all the operations that are part o f each feedback chain, and record them  in 
a set for each chain. These shall be referred to as the feedback sets. No operation 
in a feedback set may be assigned to a pipeline stage until all the operations in that 
set are ready to be assigned.
•  Create an ordered list o f pipeline stages, initially consisting o f a single entry. Each entry 
contains the set o f operations that have been assigned to that pipeline stage.
•  For each operation in the remaining set:
-  Create a tem porary set containing this operation and any operations in the sam e 
feedback set (if one exists).
-  Determ ine whether any of the operations in the tem porary set have any successors 
that are also in the remaining set. If they do, then the tem porary set is not ready, so 
discard it and move on to the next operation in the remaining set.
-  D eteim ine w hether any constraints involving the operations in the tem porary set 
involve opeiations that are also in the remaining set. If they do, then the tem porary 
set is not teady, so discard it and move on to the next operation in the remaining set.
-  Identify the latest pipeline stage where all the operations in the tem porary set could 
be placed, according to their dependencies and constraints.
144
P ipelining
-  C onstruct a configuration context containing all the pipeline stages constructed  thus 
far, and calculate its critical path delay2.
-  Speculatively construct a configuration contex t contain ing all the p ipeline stages 
constructed  thus far, including the operations from  the tem porary  set, p laced in the 
previously identified p ipeline stage. C alculate its critical path delay.
-  I f  the critical path delay is d ifferent (i.e. increased), and the new delay exceeds 
the target, then m ove to the preceding pipeline stage3. Long connections m ay have 
delays m any tim es greater than the target, in w hich case the insertion point is m oved 
back by several pipeline stages.
-  T ransfer the operations from  the tem porary set to the identified p ipeline stage, and 
rem ove them  from  the remaining set.
-  L oop w hilst the rem aining set is not empty.
•  A dd the operations from  the ju m p  chain se t to the first p ipeline stage.
The algorithm  is a form  o f list scheduling. Only operations w hose predecessors (in the data 
path) have already been assigned a pipeline stage m ay be considered for insertion on each pass. 
In order to m inim ise the register count, operations should be placed in as late a pipeline stage as 
possible. O perations that m ust be p laced in the sam e stage are dealt w ith together. O perations 
are considered for placem ent in the earliest pipeline stage containing any o f  their successors. 
Then, the insertion  point is m oved tow ards earlier p ipeline stages until all constrain ts have been 
satisfied. O nce a valid insertion point has been identified, the critical path is calculated  for the 
resulting  (incom plete) configuration context w ith the operation in that p ipeline stage. If the 
critical path m eets the target value, the operation is p laced in that pipeline stage. O therw ise, the 
operation is added to an earlier p ipeline stage— the gap (num ber o f  stages earlier) being equal 
to the critical path divided by the target critical path. This allow s for the interconnect itself to 
be p ipelined, w here pipeline stages can contain ju s t w ires betw een pipeline stage registers.
T he creation  o f dependencies ensures that the sequence o f  state changes is m aintained, thus 
ensuring correct results. A ssigning operations to a late a p ipeline stage as possible aids to reduce 
the num ber o f registers required. O nce the p ipeline stages have been determ ined, p ipeline stage 
registers are assigned as follows:
•  F or each pipeline stage in sequence:
-  A ssign a new  register storing the value produced by each operation in all previous 
pipeline stages that needs to be stored fo r use in this o r any later stage.
2 including the reading from  and w riting to pipeline registers.




The pipeline stage assignm ent algorithm  described here cannot pipeline data flow graphs con­
taining large-scale feedback loops4 involving registers in the core, w here the final result o f  one 
iteration is involved in the calculation som e iterations later. The data flow graph w ould show 
the final result being fed through a chain of registers, back into beginning o f the graph. The 
register feedback chain detection logic would see this entire chain as all having to  be in the 
sam e pipeline stage. In reality, they do not really need to be in the same stage; the feedback 
registers could instead be re-used as pipeline stage registers. This re-use o f  existing registers as 
pipeline stage registers is called re-timing, and is outside the scope o f this thesis.
'i.e. infinite im pulse response filters.
146
Pipelining
5.4 Contribution: Multi-Step Pipelining
N orm ally, a p ipelined design w ould require additional logic to take care o f  in itialising  the 
p ipeline stages, o r to suppress the operations in later pipeline stages until the previous stages 
have filled (predication), so that they do not operate on garbage. H owever, the p ipelines in a 
coarse-grained  D RA  are them selves rendered as part o f the configuration context. Provided 
that the configuration tim e is not significantly larger than the execution tim e o f  each step, dy­
nam ic reconfiguration can be used to  render d ifferent configurations before the m ain kernel loop 
configuration, to fill successive stages o f  the pipeline, and sim ilarly to  flush the p ipeline after 
exiting the kernel loop. This allow s the kernel loop configuration to assum e that the p ipeline 
stages are alw ays full. This provides a generic, purely softw are alternative to  p redication , w hich 
can be used as a fall-back w hen no hardw are support exists.
Fill: New  configuration contexts are created to initially fill each successive stage o f the pipeline. 
For n  p ipeline stages, n  — 1 pipeline filling contexts are created.
Loop: A single configuration context is created fo r the kernel loop, w hich includes all pipeline 
stages.
Flush: New configuration contexts are created  to flush successive stages o f  the pipeline. F or n  
pipeline stages, n  — 1 p ipeline flushing contexts are created.
The core is dynam ically  reconfigured to first perform  pipeline initialisation, then reconfigured 
to execute the kernel loop, then finally reconfigured to flush the p ipeline— as dem onstrated  in 
figure 5.5. This is sim ilar to the epilogue and prologue in softw are pipelining [321.
(a) (b) (c) (d) (e)
Figure 5.4: The sequence of configuration contexts created for the example kernel, (a) iteration 
]— filling pipeline stage 1, (b) iteration 2— filling pipeline stages 1 and 2, (c) iterations 
3 to n —2— pipeline full (loop), (d) iteration n — 1— flushing pipeline stage 1, (e) iteration 










1 Stage 1 | Stages 
1 and 2
I Kernel 1 Flush 1 Flush
Loop 1 Stage 1 1 Stage 2
Figure 5.5: Control flow for a 3-stage pipelined kernel, showing which stages are active in each 
context (and moment in time). Execution flows from one context to the next, except in 
the kernel loop, which loops back to itself—holding the same context— until the end 
condition is satisfied.
The configuration contexts generated for the kernel exam ple from  figure 5.3 is shown in fig­
ure 5.4. The use o f separate special-purpose configurations alleviates the need for special logic 
for this purpose in the kernel loop configuration context. Furtherm ore, there are no hardw are- 
enforced lim itations to which pipeline stage each operation can be placed in,5, thus m axim ising 
the achievable pipeline depth.
F igure 5.5 shows which stages of the pipeline are active during execution for a 3-stage pipeline. 
As the target architectures may not be state free (e.g. m em ory access), it is im portant to not 
allow any operation in any pipeline stage to operate on garbage, and to preserve the execution 
count. W ith the arrangem ent shown in the figure, all pipeline stages will be executed the sam e 
num ber o f tim es irrespective o f  the num ber o f iterations perform ed in the kernel loop.
Now consider the original kernel, where the ju m p  operation causes the loop to term inate after 
n  iterations. In the pipelined kernel, we must ensure that the kernel loop term inates after n  
executions o f the operations that calculate the loop term ination condition; otherw ise, the op­
erations or operands would need to be m odified to yield a different iteration count. Looking 
at figure 5.5, the minimum num ber o f  iterations possible in the pipelined design occurs when 
the kernel loop context executes only once. This corresponds to an iteration count equal to the 
num ber o f pipeline stages (in this case 3). In order for the loop to term inate im m ediately, the 
operations that determ ine the loop term ination condition must have been executed this num ber 
o f tim es by the time the kernel loop context has been executed. This can only be achieved by 
placing these operations in the first pipeline stage. The same argum ent also applies for any 
higher iteration count. Figure 5.6 shows two exam ples, to highlight this point.




(a) S2 S2 S2
S3 S3 S3












Figure 5.6: Expanded control flow for the pipeline shown in figure 5.5 for (a) 3, and (b) 6 iterations.
The point at which the loop termination condition should evaluate to true is shown by 
a dotted box. It can be seen in both cases that only the first stage has executed for the 
desired number of iterations by this point.
Placing the ju m p  in the first pipeline stage therefore requires that all o f  its dependencies are also 
p laced in the first pipeline stage. S ince the pipeline filling contexts (prologue) should always be 
executed in sequence (with no branching), the ju m p  operation is om itted  from  these contexts, 
even though it is in a pipeline stage active in those contexts. Its dependencies are left in place, 
since their side effects are im portant— e.g. they could  update the iteration counter w hose value 
is used to determ ine the loop term ination condition.
5.4.0.2 L im itations
A lthough the creation o f separate sequences o f  fill and flush configuration contexts m akes for a 
closer m atch to conventional softw are pipelining, giving m axim um  flexibility o f  w hich pipeline 
stage to assign each operation to, it in troduces a significant program  m em ory overhead. For 
exam ple, in  typical stream ing applications w here the program s consist o f a few large kernels 
w ith a few  steps o f  glue code in-betw een, pipelin ing the kernels can easily increase the step 
count (and thus program  size) by an order o f m agnitude. However, since each o f  these contexts 
is a subset o f the kernel loop context, they are good candidates for code com pression. Even 
w ith com pression, there will be a trade-off betw een the throughput o f the loop context and the 




5.5 Contribution: Single-Step Pipelining
For very large cores, data paths in kernels with high cell utilisation can have very long critical 
paths. It is these kernels that benefit most from  pipelining, in terms o f throughput. However, 
when using the dynam ic pipelining technique with m ultiple configuration contexts, described 
in section 5.4, the overhead on the program  m em ory o f storing the series o f fill and flush con­
figuration contexts can be extrem ely prohibitive.
The sequence of fill and flush contexts each contain a subset o f the pipeline stages in the kernel 
loop context. In other words, each of these contexts contain a subset o f the active cells and 
connections in the kernel loop context, and no new inform ation. As a result, they w ould seem 
to be good candidates for com pression. However, since the operations involved in each pipeline 
stage can be arbitrary, it is difficult to partition the configuration stream in such a way as to 
m ake the partitions align w ith the pipeline stages o f any given set o f kernels. This m eans that 
real-tim e, hardw are-based com pression schem es m ay not be applicable.
This section explores m odifications to the hardw are that allow a kernel to be pipelined to an ar­
bitrary depth, w hilst only requiring a single configuration context (or at most, two configuration 
contexts— as will be explained in section 5.5.2) to be stored in memory.
At the m ost basic level, the idea is to execute the same configuration context (the kernel loop) 
for all three phases: fill, loop, and flush . As shown in figure 5.6, in order to m aintain the original 
behaviour, each pipeline stage m ust be executed the same num ber o f tim es as the original loop 
was to be executed. However, since each pipeline stage depends on the data produced by the 
previous pipeline stage in the previous iteration, each pipeline stage m ust be delayed by one 
iteration from  the previous one. If one configuration context is to represent all o f this, then this 
context m ust perform  more iterations than the original loop. In fact, it m ust perform  n  — 1 
additional iterations (where n  is the num ber of pipeline stages). In these additional iterations, 
the operations o f one or m ore pipeline stages will be operating on garbage.
To avoid the operations o f  each pipeline stage from  being executed too m any tim es, a m echa­
nism is needed to disable them (predication ). The next four sections present different ideas that 
gradually build up to a practical solution to this problem .
5.5.0.3 Idea 1
One sim ple way to prevent the cells from being executed too many tim es, would be to associate 
a hardw are iteration counter with each cell, and initialise it (via the configuration) w ith a value 
stating how many initial iterations during the pipeline fill phase it should be disabled for. This 
also represents the num ber o f initial iterations during the pipeline flush phase that it should be 
enabled for (after which it would becom e disabled).
However, this approach introduces a substantial overhead in the configuration size, since each 
cell would have an initial counter value associated with it.
150
Pipelining
Even if  this w as stored separately to the rest o f  the configuration stream , so as to only im pose 
these additional configuration bits to pipelined kernels, it is still a significant overhead. F urther­
m ore, the num ber o f  bits in each o f these fields im poses a m axim um  possib le pipeline depth o f 
2” — 1. S ince the configuration size overhead scales linearly  w ith this bit w idth, there w ould be 
significant pressure to m ake this value as small as possible.
5.5.0.4 Idea 2
T he first observation is that only operations that have internal state can affect the overall b e ­
haviour o f  the program  if  they operate on garbage. N .B. purely com binatorial operations that 
are executed  m ore tim es than necessary (operating on garbage in these additional iterations) 
w ill have no effect on the state o f  the system , and thus have no effect on program  behaviour; 
they m erely in troduce a small increase in pow er consum ption. T herefore, it is only  necessary 
to control the iteration count o f  cells that m aintain internal state.
T he cells that have internal state generally  com e under two categories: registers and hardware 
I/O. H ardw are I/O  cells act as interfaces to external hardw are, and these are few in num ber in 
a  typical core. On the o ther hand, register cells represent a significant fraction  o f  the cells in a 
typical core— especially  a large core for use w ith p ipelining. As a result, this first observation 
can be used to reduce this configuration overhead by m aybe 50%  for a typical core, by only 
providing initial values for the iteration counters o f cells that m aintain internal state.
5.5.0.5 Idea 3
T he effect on a register cell o f operating on garbage is to replace its internal value w ith garbage. 
T herefore, d isabling a register cell sim ply prevents it from  replacing its in ternal value w ith the 
value currently  read from  its input port. The astute reader may correctly  point out that the 
m ajority o f  registers in a typical large core are used as p ipeline stage registers— to bridge values 
from  one p ipeline stage to the next. H aving a large num ber o f  them  increases their chance o f 
being  close to w here they are needed in the pipelined data paths, thus decreasing the critical 
paths.
In the case o f pipeline registers, the initial value stored in the register is o f  no consequence 
during iterations when that p ipeline stage is still aw aiting data during p ipeline filling, and sim i­
larly, the final value stored in a  register is o f no consequence after that stage has com pleted  the 
correct num ber o f iterations.
So, one solution w ould be to introduce a separate class o f  register cell specifically for p ipelin ­
ing. Such cells w ould not have a counter associated with them . However, this poses a significant 
problem : w e cannot sensibly choose in advance w hich registers should be expressly for p ipelin ­
ing, and w hich should be norm al registers. Any m ism atch betw een the shape o f  the p ipelined 
data paths and the core w ould result in additional path lengths, w hich m ay severely im pact per­
form ance, and require m ore p ipeline stages to be created, m aking the problem  w orse. This is 
the sam e problem  as why com pression is difficult.
151
Pipelining
5.5.0.6 Idea 4— A Practical Solution
The second observation is that out o f the cells that have internal state, hardw are I/O cells are 
norm ally disjoint— their output does not depend on their input in the current iteration. In fact, 
input to the cell and output from  the cell tend to (by design) be distinct, independent data 
stream s within that kernel. The result of this is that the output side o f an I/O cell tends to act as 
a data source bringing data into the kernel, and thus appears in a very early pipeline stage; and 
the input side o f an I/O cell tends to act as a data sink bringing data out o f the kernel, and thus 
appears in a very late pipeline stage. Generally, m oving the input side o f all hardw are I/O cells 
active in the kernel into the first pipeline stage, and m oving the output side o f  all hardw are I/O 
cells active in the kernel into the last pipeline stage, sim ply introduces a slight increase in the 
num ber o f pipeline stage registers needed.
As a result, we can (in effect) hardw ire the counter associated with the output side o f the 
hardw are I/O cells to 0, and the input side to n  — 1. This obviates the need to store the initial 
value in the configuration stream.
A third observation is that registers that are both read from  and w ritten to in the kernel (term ed 
kernel registers) m ostly act as counters, and thus appear in early pipeline stages. Typically, their 
stored value is used as a data source to other operations further down the critical path o f the 
kernel, and their new value is sim ply the current value increm ented (or som e sim ilar operation 
with a short critical path). We can therefore move m ost o f these kernel registers into the first 
pipeline stage, w ithout affecting the ability to pipeline.
W ith all registers appearing in the first pipeline stage, the counter value associated w ith them  
will always be initialised to 0, and as a result, there is again no need to store it in the configu­
ration stream. This com pletely elim inates the need to store additional data in the configuration 
stream.
However, there is a problem : som e kernel registers are not used as sim ple counters, and in­
stead are w ritten to from  further down the critical path o f the kernel. This often occurs when 
partial results from  previous iterations are to be re-used, instead o f re-calculated. Forcing all 
such registers into the first pipeline stage severely lim its the extent to which the kernel can be 
pipelined— often preventing it from  being pipelined at all.
A further observation is that in alm ost all cases identified, these kernel registers do not bring 
data into the kernel from previous steps. Therefore, the initial value that they store is not 
im portant, and can be safely overwritten with garbage during pipeline stage filling, w ithout 
affecting program  behaviour. As a result, these can be safely placed in any pipeline stage.
In the very few cases where the initial value o f kernel registers is im portant, som e additional 
instruction cells can be used to preserve the initial value, as will be described in section 5 .5 .2 .
152
Pipelining
5.5.1 Contribution: Hardware Modifications For Single-Step Pipelining
Recall that one o f  the distinguishing features o f the target architecture is that it is in control 
o f  its own reconfiguration— i.e. the state o f the m achine and its data paths determ ine w hich 
configuration context to load and when. The jump cell is responsible fo r reading the next jum p 
target from  the data paths o f the current configuration context, and the condition fo r w hen this 
target should be loaded into the program  counter.
p ro g ra m  c o u n te r  
ju m p : : in  a d d r 
ju m p :: in _ c o n d
X
ro : • ..............  )CZXvZX





T 1 ' 1 1 1 1 1 1
Figure 5.7: Internal control signals during execution of a normal (non-pipelined) kernel (step 
index 3), within an outer loop (step indexes 2, 3, 4, and 5). jum p: : i n  a d d r  and 
jump : : in „ co n d  are read from the reconfigurable data paths. The program counter 
is internal to the jump cell, and shows the index of the configuration context currently 
loaded in the core. The values shown for the data paths are those once they stabilise 
near the end of the current iteration.
W hen executing a kernel (i.e. a  configuration that loops to itself), the jum p  target is the kernel 
itself. The data paths and state are constructed such that the condition causes the kernel config­
uration context to loop back to itse lf until the appropriate iteration count has been com pleted, 
after w hich control passes to the next step in sequence. F igure 5.7 show s the state o f  the signals 
involved during this. In this exam ple, a kernel (step index 3) executes fo r 6 iterations, inside 
each iteration o f  the outer loop (step indices 2, 3, 4, 5). D uring execution o f  the 6th iteration o f 
the kernel, the data paths driving the in_cond input o f  the jum p  cell stabilise to zero, ind icat­
ing that the jum p  back to the kernel should not be perform ed, and as a result control passes to 
the next step (index 4).
In order to support single-step p ipelining, the m ain addition to this is the p ipe line  depth  counter  
(PDC). This hardw are counter is initialised to a value given in the configuration, equal to the 
num ber o f  p ipeline stages in the pipelined kernel. The purpose o f  this counter is as follow s:
•  To delay sw itching to the next configuration context, to allow  pipeline flushing to take 
place.
•  To disable during p ipeline filling, those cells w ith internal state that m ust be in the last 
p ipeline stage.
•  To disable during pipeline flushing, those cells w ith internal state that m ust be in the first 
p ipeline stage.
T he jum p  cell is further extended to contain a new 2-bit state representing  the current execution  
phase, w hich can be one o f the following:
153
Pipelining
Normal: Norm al (non-pipelined) step execution.
Filling: Filling the pipeline stages o f a pipelined kernel.
Looping: Executing the loop o f a pipelined kernel.
Flushing: Flushing the pipeline stages o f a pipelined kernel.
Upon entering the pipelined kernel,6 the execution phase is set to filling , and the PDC is in i­
tialised to the num ber o f pipeline stages m inus one. The PDC is decrem ented by one after each 
iteration o f the kernel, until it reaches zero. W hen the PDC reaches zero, the execution phase 
shifts to looping. W hen the jum p ce ll’s inputs signal the end o f the kernel, the execution phase 
shifts to flush ing , and the PDC is set to the pipeline depth m inus one again. The current config­
uration context continues to execute (i.e. the jum p is delayed) w hilst in the flush ing  phase. The 
PDC is again decrem ented after each iteration, and upon reaching zero, the execution phase 
shifts to norm al, and the next configuration in sequence is executed.
Figure 5.8: Internal control signals during execution of the kernel from figure 5.7, this time (single- 
step) pipelined into 3 stages. The two new signals are also internal to the jump cell.
This m echanism  increases the iteration count for the kernel by n  — 1 (w here n  is the num ber of 
pipeline stages), and produces signals to indicate when filling and flushing. Figure 5.8 shows 
the sam e kernel and outer loop as in figure 5.7, but the kernel has been pipelined into 3 stages. 
The figure shows the normal execution outside o f the kernel (the N orm al execution phase), 
and the sequence o f Fill, Loop, and Flush  inside the kernel. A fter six iterations— the desired 
duration of the inner loop— the data paths driving the jum p ce ll’s inputs indicate that it is tim e 
for the kernel to end, so the jum p condition goes low. However, this tim e another two iterations 
o f the kernel are perform ed, to allow the pipeline stages to be flushed, before allow ing control 
to pass to the next step (index 4).
The cells which maintain internal state, which must either be in the first or last pipeline stage, 
can m onitor the current execution phase to determ ine when they should be disabled. N ote 
that many cells are disjoint, where the input and output sides are independent. In these cases, 
the input side is in the last pipeline stage so disables itself during the filling execution phase, 
w hereas the output side is in the first pipeline stage and so disables itself during the flushing 
execution phase.
'’indicated by a non-zero initial PDC value in that configuration context.
154
Pipelin ing
5.5.2 Contribution: Software Modifications For Single-Step Pipelining
T he pipelin ing algorithm  rem ains largely unchanged; the difference lies m ainly in the prepara­
tion o f  the kernel data, flow graph. The D FG  edges representing  operations w hich have internal 
state, have to  be reserved for insertion into the first o r last pipeline stage, as appropriate.
To prevent these from  blocking the pipelin ing algorithm , all predecessors o f  the D FG  edges 
reserved for the first p ipeline stage m ust also be reserved for the first pipeline stage. So too 
m ust all o f  their predecessors, and so on (recursively).
S im ilarly, to avoid D FG  edges not being assigned a p ipeline stage, all successors o f the D FG  
edges reserved for the last p ipeline stage m ust also be reserved for the last p ipeline stage. So 
too m ust all o f  their successors, and so on (recursively).
T he p ipeline stage assignm ent process begins by creating the first p ipeline stage using the D FG  
edges that w ere reserved for the first p ipeline stage. Then the p ipelin ing algorithm  is used to 
determ ine w hich D FG  edges can be added to the current p ipeline stage, and w hich require a 
new pipeline stage to be created in order to satisfy the tim ing constrain t (and other constraints). 
Finally, the edges reserved fo r the last pipeline stage are added to the last p ipeline stage that 
w as created, unless doing so w ould violate the tim ing constraint, in w hich case a new pipeline 
stage is created for them.
As before, the tim ing constrain t is adjusted if it is less than the critical path o f any o f  the non- 
pipelineable data paths, w hich include the edges reserved for the first p ipeline stage, and those 
reserved for the last p ipeline stage.
5.5.2.1 Registers
Recall that registers appearing in the original kernel data flow graph, that are both read from  
and w ritten to in the kernel, are placed in the first pipeline stage if possible. This is to ensure 
that their initial value (upon entering the kernel) is not corrupted during pipeline stage filling. 
However, if  their final value (upon exiting the kernel) is im portant, placing the operation in the 
first pipeline stage results in the final value being corrupted during pipeline stage flushing. A lso 
recall that the operations involved in updating the reg ister’s value constitu te feedback, and m ust 
all be in the sam e pipeline stage, otherw ise it w ould take m ore than one iteration fo r the new 
value to propagate through, leading to  corruption. Therefore, the situation cannot be resolved 
by placing the input side o f the register in the last pipeline stage.
The special-case solution to this problem  for registers (i.e. preserving their final value upon 
exiting the kernel), is to duplicate the register so that the new value is w ritten to the original 
register and another register at once. A chain o f pipeline stage registers is then used to bring that 
duplicate value into the last pipeline stage, from  w here it will survive pipeline stage flushing. 
A new step is created after the kernel, to copy this final value back to the original register. 
By convention, this step is labelled the sam e as the kernel, but w ith the Afinalise’ prefix 









Step L 2 _ f i n a l i s e
stage  3
Figure 5.9: Construct used to preserve the final values of kernel registers upon exit from the ker­
nel. This is used only where necessary. The kernel register (V6) is highlighted red. r l  
(highlighted in green) is the register used to bring the final value out of the kernel. The 
required pipeline stage registers are shown with a dotted outline.
For k  kernel registers, up to k  x  n  additional registers would be required (where n  is the 
num ber o f pipeline stages). To m inim ise this overhead, the live register identification algorithm  
described in section 4.7.1 (on page 77) is used to determ ine w hich registers are actually live on 
entry to the next block after the kernel, so that only these need to have a delay chain created for 
them. In many cases, the final values o f all the kernel registers are found to be irrelevant, so no 
delay chains and no fina lise  step are created.
As discussed at the beginning o f this section, som e kernel registers are w ritten to from  opera­
tions further down the critical path o f the original kernel. If  these are placed in the first p ipeline 
stage, then the m axim um  possible pipeline depth is severely lim ited. The live register inform a­
tion is again used to identify which o f these register’s initial value is dead on entry to the kernel, 
w hich therefore allows them to be placed in any pipeline stage.
However, in a very small num ber of cases experienced, the initial value o f  som e o f these kernel 
registers is live. This prevents an adequate pipeline from being form ed. A solution was devised 
to avoid this: additional logic is added to the chain of operations that supply the input of the 
kernel register. This additional logic consists o f a m ultiplexer cell to choose betw een the newly 












Q  9  og
o
Figure 5.10: Construct for supplying the initial value to kernel registers on entry to the kernel, 
allowing them to appear in any pipeline stage. This is only required in rare cases. 
The kernel register is highlighted red. Pipeline stage registers are shown with a dotted 
outline. Register r2 (highlighted in green) is used as a counter to drive the mux select 
signal, and must be reset to zero in all steps that pass control to the kernel (L3 in this 
example). This allows the kernel register to maintain its initial value until the first 
new value is ready.
m ultip lexer is then driven by a counter constructed  out o f  another register and an adder cell, 
along with a com parator. The w hole construct is show n in figure 5.10. T he counter is used to 
m ake the m ux make the kernel register hold its initial value during the p ipeline filling iterations 
prio r to the stage w here the kernel register appears, and then sw itch to  the new ly calculated 
value thereafter. The counter is placed in the first pipeline stage, but directly  drives the mux, 
desp ite the mux being in a later p ipeline stage. Special logic in the p ipelin ing  algorithm  is used 
to avoid inserting p ipeline stage registers in this case.
157
Pipelining
5.6 Contribution: Automating The Choice of Timing Constraint
The previous sections in this chapter proposed a technique where pipelining would be per­
form ed based on a critical path constraint provided by the application developer. This section 
elaborates on this technique, by proposing how to autom ate the choice o f  critical path con­
straint, in order to m axim ise the real-life throughput whilst m inim ising resource usage.
The arbitrary operation chaining supported by the target architectures leads to a great variation 
in critical path length in different configuration contexts, as paths can be constructed involving 
long chains o f a varying num ber of cells, and each type of cell has a different com binatorial 
delay. Ideally, each iteration o f the configuration context should be allow ed to persist for the 
tim e required for the results to stabilise on the operation(s) that lie at the end o f the critical path. 
In order to avoid the overhead o f asynchronous logic, a m aster clock is norm ally used instead, 
and the iteration ends on the next m aster clock cycle after the last results have stabilised, as 
can be seen in figure 5.11. To m inim ise the resulting idle tim e between these tw o events, it is 
desirable to m inim ise the period o f the m aster clock. However, high clock frequencies com e at 
the cost o f power consum ption and area. Therefore, a suitable com prom ise has to be made.
Figure 5.11: Idle time resulting from the master clock. The shorter the critical path of the kernel, 
the more effect this has. This particularly affects pipelined kernels.
Since pipelining reduces the critical path length of each iteration o f the kernel loop config­
uration context, the quantisation introduced by the m aster clock frequency affects pipelined 
contexts more. Therefore, it is im portant to m inim ise the wasted tim e between the critical path 




T he tim ing constrain t is initially  chosen to be the m inim um  possib le critical path length  that a 
pipeline stage can consist of. This is determ ined by the length o f certain  data paths that cannot 
be split across pipeline stages. These include the jum p  condition logic determ ining w hen to 
finish the loop, and feedback loops that update a reg ister o r m em ory location7. The one with 
the longest critical path length  is selected, and the value rounded up to the next in teger m ultiple 
o f the m aster clock period.
T hen, pipeline stage allocation is perform ed using this critical path constraint. If  a valid p ipeline 
could be constructed, register allocation is perform ed. If  there are sufficient registers available, 
then this p ipeline geom etry is used, since it w ill result in the h ighest possib le iteration rate. 
O therw ise, the tim ing constrain t is increm ented by one m aster clock period, and the process 
continues. A natural end point exists w here this value reaches the critical path o f the non­
pipelined kernel. If reached, the context is left non-pipelined.
A  linear search o f possib le target critical path constrain ts could  be qu ite slow, so an optim isation  
is to  perform  a b inary  search instead. The initial search space is that bounded by the m inim um  
possib le pipelined critical path and the original (non-pipelined) critical path. A binary  search 
consists o f perform ing m ultiple passes o f splitting the current search space in halfs , choosing 
the upper ha lf as the search space in the next pass if p ipelining succeeds w ith that target, or the 
low er ha lf otherw ise. The search stops once the search space falls below  the granularity  o f  the 
m aster clock. A  record is kept o f  the shortest target critical path constraint that resulted  in a 
valid p ipeline, so that this can be restored once the search ends.
For com pletely autom atic pipelining, static analysis is used to identify  configuration contexts 
that loop back to them selves {kernels). These are potential candidates for p ipelining. The 
m inim um  consecutive iterations for a kernel defines the m axim um  depth to w hich it can be 
pipelined: the pipeline depth m ust not be less than the m inim um  execution count. This is used 
as a test during each iteration o f  the tim ing constra in t selection algorithm , w here a potential 
pipeline is checked for its depth not exceeding the m inim um  iteration count. I f  it does, then 
the geom etry is considered invalid, and the algorithm  continues w ith a larger tim ing constraint. 
Furtherm ore, for the m ulti-step pipelining technique (described in section 5.4), to take into 
account the cost o f  loading the new configurations from  m em ory, the m inim um  iteration  count 
value is artificially reduced by an arbitrary count, to w eigh the algorithm  in favour o f  only 
pipelining loops with significant iteration counts.
S tatic analysis o r feedback-directed optim isation is used to determ ine the m inim um  possible 
iteration  count for each kernel. Feedback-directed optim isation  is used only if  static analysis 
identifies that the iteration count is variable (and thus not statically analysable). For feedback- 
directed optim isation, the program  is first executed in the em ulator (chapter 3) p rior to p ipe lin ­
ing, and profiling inform ation is fed back into the com piler. The num ber o f  consecutive itera­
tions o f each candidate is determ ined through the profiling results.
7w here that register or m em ory location is both read from  and w ritten to in the sam e kernel. 
susing that value as the next target critical path  constraint.
159
Pipelining
5.7 Contribution: Support for Internally Pipelined Cells
The ability to pipeline com plex kernels using the algorithm  and techniques presented earlier in 
this chapter, means that in m any cases, the critical path of the pipelined kernel is that o f reading 
from  a (pipeline) register, passing com binatorially through one or two functional cells, then 
writing to another (pipeline) register. This reduction in critical path is w hat leads to substan­
tial increase in throughput, but also means that the com binatorial delay o f each cell accounts 
for a much larger relative contribution to the critical path, and thus also the throughput. The 
difference between the com binatorial delay o f each type of cell therefore becom es increasingly 
significant. For cells w ith large com binatorial delays such as dividers, m ultipliers, and random  
access to memory, these often cause a bottleneck in the step.
Figure 5.12: A divider cell internally pipelined to 4 stages. For minimum critical path contribu­
tion, reading from the inputs of the cell constitutes one pipeline stage (stage 1), and 
writing to the outputs of the cell constitutes one pipeline stage (stage 4).
The solution to this is to break up the internal logic in these cells into pipeline stages, as il­
lustrated in figure 5.12. Unlike the dynam ic pipelining talked about earlier in the chapter, this 
internal pipeline is fixed in hardware at design-tim e for the array, and cannot be altered.
Once internally pipelined, a cell can no longer be com binatorial— the result for the current 
iteration is delayed by several iterations^. So configuration contexts involving these cells have 
to be m odified to take account of this. Both scheduling and pipelining are modified accordingly.
Since each cell type can support several configurations which correspond to different opera­
tions10, and these operations can be of different complexity, each operation should be allow ed 
to be pipelined to a different depth.
‘’the num ber o f pipeline stages minus one.
"’each represented by a different instruction.
160
Pipelining
5.7.1 Contribution: Scheduling Internally Pipelined Cells
W hen passing values into an internally  p ipelined instruction, the corresponding result does not 
appear at the output o f  the cell until several iterations later. T he scheduling algorithm  and 
data flow graph (DFG) data model that it operates on, as introduced in section 4.5 on page 67, 
assum e that all operations— except registers— are com binatorial.
T here are tw o w ays to ensure that the operations that depend on the result o f  the internally 
p ipelined instruction (with m  p ipeline stages) appear in the correct iteration:
* Split the basic block into m ultiple steps, m aking sure that the successors o f  the internally 
p ipelined instruction appear m  steps later than w here the inputs w ere w ritten to the cell.
•  C reate a single step from  the basic block and pipeline that step, m aking sure that the 
successors o f  the in ternally  pipelined instruction appear m  stages la ter than w here the 
inputs w ere w ritten to the cell.
T he data flow graph (DFG ) data m odel is altered as follow s: Instructions are now split into 
slots— each slot corresponding to an internal pipeline stage. W here previously each available 
instruction  cell resource in the core was subdivided into available instruction  s lo ts11, now each 
instruction slot is further subdivided into pipeline stage slots. A ny or all o f  the p ipeline stage 
slots for an instruction slot can be occupied in each step.
An exam ple o f  this is shown in figure 5.13. The figure shows a hypothetical cell fo r accessing 
the contents o f a stream  m em ory (line buffer). T he cell can be set up to read successive entries, 
one per iteration, autom atically  increm enting the address each tim e. To hide the m em ory la­
tency, the m em ory access is internally pipelined (fixed in  hardw are). Since the address can be 
predicted  beforehand and is not read from  the data path in the sam e step, there is no need to ex­
pose the internal p ipeline to the reconfigurable data paths— the cell can still be com binatorial (it 
has only an output). In this m ode, the STREAM_SET_ADDRESS instruction sets the beginning 
address to start reading from , and begins fetching the first few values before the step ends. The 
next step in the program  (usually a kernel) would then use the STREAM.READ instruction to 
fetch subsequent values from  the stream , one per iteration. To be able to perform  random  access 
reads on the contents o f  the stream , one per iteration, the internal p ipeline has to be exposed to 
the core. This is because the address cannot be pred ic ted  beforehand— it has to be read from  
the reconfigurable data paths. Therefore, the cell has to have both an input (the address) and an 
output (the result from  an earlier iteration) active in the sam e step. The operations o f setting 
the start address, sequential reading, and random  reading are m utually  exclusive— they cannot 
occur in the sam e step. However, the 3 stage internal pipeline o f  the random  read instruction 
allow s for up to 3 random  read operations to be in-flight at once (if  it appeal's in a p ipelined 
kernel, as shown in figure 5 .13(f)).
So, w here previously an instruction w ould have a single D FG  edge associated to it (like that 
show n in figure 5.14(a)), an internally pipeline instruction is represented  by m ultiple D FG 
edges, one for each internal p ipeline slot o f  that instruction (as show n in figure 5.14(b)). Only 
the first o f  these D FG  edges has inputs, w hich correspond to the inputs o f  the instruction. Only
"e a c h  corresponding to a configuration representing a different type o f  instruction supported by that cell, where 


























random_read i m I
stream _set_addr
stream_read
random_read l l 1
Figure 5.13: Instruction slots corresponding to an instance of a cell that supports 3 different types 
of instruction, one of which is internally pipelined. Valid examples of which slots 
can be filled (highlighted in green) in a configuration context are shown: (a) set­
ting the start address for reading from a stream, (b) reading from a stream, auto­
incrementing the start address, (c,d,e) random access read where the basic block has 
been split into multiple steps, (f) random access read in a pipelined kernel.
the last o f these DFG edges has an output, which corresponds to the output o f the instruction. 
All the other DFG edges for the instruction have no inputs or outputs. This prevents invalid 
connections from  being m ade w hich tap into inaccessible internal partial results. However, 
in order to expose the dependencies between these edges,12, som e step la ter  constraints are 
defined between them, as shown by the bold arrows in the figure.
All that needs m odifying in the scheduling algorithm  (section 4.9 on page 87) is the check for 
w hich cell resources are still available. The instances o f the corresponding cell type are checked 
for unoccupied slots m atching the edge that is being scheduled. In order for a cell instance to 
be considered available, it m ust now satisfy the following:
•  It must match any explicit instance qualification given for the instruction.
•  It must have no occupied instruction slots for other instruction types.
•  The corresponding internal pipeline slot for the instruction type and internal pipeline 
stage must be available.





Figure 5.14: Assembly data flow graph (DFG) (left) and corresponding DFG edges (right) repre­
senting a data path containing: (a) a combinatorial divider cell, and (b) a 4-stage 
internally pipelined divider cell.
To prevent D FG  edges representing  the internal p ipeline stages for a given instruction from  
being distributed across more than one cell instance, the instructions have to be m odified prior 
to scheduling, binding them  to an explicit cell instance. B ecause this involves perform ing cell 
instance allocation p rio r to scheduling, there is a danger that it could lead to prem ature star­
vation o f  a particu lar cell instance, creating m ore steps than strictly necessary. To reduce the 
chance o f  this, during D FG  generation, a po ten tia l b indings  m ap is generated  for the instruction 
cell resources. This m ap records the num ber o f  instructions in the basic block that could  po ten­
tially  bind to each cell instance. Explicit cell instance qualifications are given to each internally  
pipelined instruction in the assem bly in ascending order o f  potential binding count.
If  the basic block is to be p ipelined, then the som e step la ter  constrain ts are replaced w ith sam e  
step or la ter  constraints, to avoid splitting the block into m ultiple steps. If  p ipelin ing subse­




5.7.2 Contribution: Pipelining Kernels With Internally Pipelined Cells
The only m odification to the pipelining algorithm  necessary to support internally p ipelined 
cells, is the addition o f a some stage earlier  constraint. The scheduling algorithm  currently  
provides two types o f constraints: same step or la ter  and som e step later. Norm ally, since 
pipelining is perform ed only on basic blocks that form  a single step, the som e step la ter  con­
straint would never be encountered by the pipelining algorithm . The rem aining constraints 
sim ply define a data How dependency, sim ilar to that defined by the connectivity through the 
data paths. The sam e step or later  scheduling constraint is converted into a sam e stage or 
earlier p ipelining constraint, w ith the operands reversed.
The new som e stage earlier  constraint is applied between the D FG  edges representing the 
internal pipeline stages o f an internally pipelined instruction. The pipelining algorithm  takes 
into account this constraint when assigning pipeline stages, by increm enting the stage index 
where the edge could first be assigned to. Because instruction cell assignm ent has already 
been perform ed, and because the only dependencies betw een the D FG edges representing the 
internal pipeline stages are the aforem entioned constraints, it is not possible for any other DFG 
edge to com e betw een these DFG edges, inserting extra pipeline stages in betw een them . As 
a result, these D FG  edges are correctly placed one stage apart, and the resulting data paths are 
seen to have been delayed by the correct am ount, so the output feeds directly into the data paths 
o f the appropriate pipeline stage. This approach also allows the pipelining algorithm  to assign 
additional pipeline stage registers to the output o f the internally pipelined cell, if  required to 
feed additional pipeline stages in the dynam ic (software-defined) pipeline.
D uring resource configuration, all the DFG edges representing the internal pipeline stages o f a 
given instruction m ap to the same instruction ce ll,13 resulting in the correct pipelined behaviour, 
with no pipeline bubbles.
As m entioned earlier, if  a step containing internally pipelined operations cannot be pipelined, 
then the basic block must be re-scheduled, this tim e allowing step boundaries to be inserted be­
tween the place-holders for the internal pipeline stages for each internally pipelined instruction. 
W ithout this, insufficient iterations would be perform ed for the data to propagate through the 
internal pipelines, resulting in data corruption.




This section shows the results from  experim ents that w ere devised to dem onstrate the follow ing:
Section  5.8.1: Im provem ent in throughput v.s. cost in term s o f  additional registers and con­
figuration contexts for tw o exam ples w ith very different m inim um  consecutive iteration 
counts, resulting from  pipelining with the m ulti-step (section 5.4) and single-step  (sec­
tion 5.5) p ipelining algorithm s.
Section  5.8.2: C om parison o f  achievable throughputs w hen internally  p ipelined m em ory ac­
cess cells are used (section 5.7) v.s. non-pipelined m em ory access cells.
Section  5.8.3: B ehaviour o f  the autom atic target critical path constra in t identification algo­
rithm , in term s o f ability to achieve the highest real-life throughput, w hilst m inim ising 
the register count required for that throughput.
5.8.1 Results: Dynamic Pipelining
T hese exam ples w ere im plem ented in C, targeting a 65nm  RICA  core w ith sufficient resources 
to im plem ent the resulting kernels (i.e. around 250 cells and an abundance o f  registers). The 
m aster clock (RRC) period is 1.0ns. O n-chip SR A M  m em ory latency is 2.0ns, and the step 
load tim e is 20.0ns (no com pression), w ith the capability  to p re-fetch  one step in advance. 
Typical kernels m apped to this core would have a critical path in the range o f  2 0-80ns. S im pler 
steps m ay have critical paths in the range 7 -20ns. The non-pipelineable data paths such as the 
program  counter update chain (jum p chain) are 4 -7 n s  in length.
T he first exam ple (dem osaic) was chosen as a typical data path intensive application, so as to be 
representative o f the types o f stream ing applications that the target architecture was designed 
for. T he m odule consists o f one large data path w ith in terdependences, to  test the ability o f the 
pipelining algorithm  to insert stages w ithout exceeding the target critical path constrain t. This 
data path has a critical path several tim es that o f the non-pipelineable data  paths (e.g. the jum p 
chain), w hich leaves room  fo r an im provem ent in throughput. This situation is very com m on. 
As well as dem onstrating  the sort o f  throughput increase to  be expected from  pipelin ing, the 
exam ple is to give a feel for the cost in  term s o f program  m em ory size and register count for 
the two pipelining m ethods (m ulti-step and single-step pipelining).
The second exam ple (DCT) was chosen as a  special case to dem onstrate som e lim itations o f 
pipelining, when dealing w ith loops w ith low iteration counts. In particular, it is in tended to 
show how the reduction in iteration tim e doesn’t necessarily  translate to a sim ilar reduction 
in total execution tim e (which determ ines throughput), due to m ore iterations needing to be 
perform ed for filling and flushing. Furtherm ore, it was also a test to determ ine w hether it is 
advisable to pipeline loops w ith low iteration counts.
165
Pipelining
5.8.1.1 S im ple  D em osaic
The first exam ple is a 3-line dem osaic filter [97], which involves interpolating m issing colour 
com ponents from  the output from a colour filter array sensor. This is a h igh-throughput task 
norm ally done on-chip (integrated into the sensor) as part o f a custom  im age signal processing 
pipeline. The filter was re-im plem ented on a reconfigurable instruction cell array, using the C 
language. Software optim isation techniques w ere used to reduce the filter kernel into a  single 
basic block, sm all enough to fit onto the target architecture in a single configuration context. 
The filter kernel data flow graph is shown in figure C .l on page 209, and the sum m ary o f  the 
operations involved are given in table 5.1. Kernels o f im aging filters such as this process an 














Table 5.1: Demosaic filter kernel resource requirements, in terms of instruction cells on the target 
architecture. *This register count does not include pipeline stage registers (since this is 
before pipelining has been applied).
T he throughput o f the resulting filter is given in the 2nd colum n o f table 5.2. The other colum ns 
show the effect o f pipelining the kernel using the m ulti-step pipelining m ethod described in this 
chapter, for several target critical path lengths (tim ing constraints). Table 5.3 shows the effect 
o f pipelining the sam e kernel using the single-step pipelining m ethod, for the sam e tim ing 
constraints. The pipelined kernel can be seen in figure C.2 on page 210.
The results are based on the filter operating on an image size o f 644 x 477 pixels. The kernel 
operates without interruption on an entire line, so in the non-pipelined case, it persists for 
644 iterations. Pipelining incurs an increase in the num ber of iterations to be perform ed—  
i.e. 77 — 1 filling iterations and n — 1 flushing iterations, for an n  stage pipeline. M ulti-step 
pipelining perform s these extra iterations using new configuration contexts that are executed in 
sequence; whereas single-step pipelining executes the kernel context for these extra iterations. 
The throughput is calculated based on the total execution tim e o f the kernel per line, including 
these extra iterations o f the kernel step itself, or the loading and executing o f each epilogue or 
prologue step. This gives the real-life throughput perform ance, taking overheads into account. 
The throughput is graphed against target critical path constraint in figure 5.15.
166
P ipelining
Target critical path (ns) None 35.0 20.0 10.0 8.0 6.0 5.0 4.0
Actual critical path (ns) 53.8 34.9 19.9 9.75 7.83 5.84 4.92 4.25
Line execution time (/us) 34.8 22.6 12.9 6.64 5.46 4.36 4.15 4.22
Pipeline stages 1 2 3 7 10 15 27 29
Additional registers 0 24 43 143 207 319 603 644
Additional contexts 0 2 4 12 18 28 52 56
Throughput (MPixels/s) 18.5 28.5 49.8 97.1 118 148 155 153
Speed-up - 54% 169% 424% 538% 699% 739% 725%
Table 5.2: Throughput performance of the demosaic filter kernel before pipelining, and after 
m ulti-step  pipelining'—for a range of different target critical path length constraints. 
Additional register and program memory (contexts) resource requirements are shown. 
Inserting more pipeline stages results in a reduction in kernel critical path, leading to a 
maximum speed up of over 7 x . However it also results in an increase in the number of 
steps. When the quantisation of the 1GHz master clock is taken into account, the last 
two columns result in the same iteration rate, but the latter has more steps, so the total 
execution time is longer, leading to a lower throughput.
Target critical path (ns) None 35.0 20.0 10.0 8.0 6.0 5.0 4.0
Actual critical path (ns) 53.8 34.9 19.9 9.75 7.83 5.84 4.92 4.25
Line execution time (/is) 34.8 22.6 12.9 6.53 5.25 3.98 3.38 3.39
Pipeline stages 1 2 3 7 10 15 27 29
Additional registers 0 26 47 156 233 361 674 721
Additional contexts 0 1 1 1 1 1 1 1
Throughput (MPixels/s) 18.5 28.5 49.8 98.7 123 162 190 190
Speed-up - 54% 169% 433% 562% 774% 929% 926%
Table 5.3: Throughput performance of the demosaic 3x3 filter kernel before pipelining, and after 
single-step  pipelining. Compare with table 5.2. A single additional configuration context 
is produced in each pipelined case, to restore the final values of live output registers. A 
maximum speed up of nearly 10 x is achieved here, compared to 7x  in figure 5.2, despite 
the same number of pipeline stages and critical path. This is because the single-step 
pipelining avoids the overhead of loading additional steps for filling and flushing.
Both m ethods o f  structural-level p ipelining can be seen to significantly increase the throughput, 
at the expense o f  extra registers (figure 5.16). The m ulti-step  p ipelin ing m ethod incurs an ad ­
ditional overhead in term s o f the program  m em ory required for each new  configuration context 
(prologue and epilogue). The last colum n in tables 5.2 and 5.3 shows that a natural th roughput 
lim it is reached, determ ined by the longest non-pipelineable entity  in the data flow graph. In 
this case it is a m ultiplier being fed by pipeline registers, and w riting directly  to a p ipeline reg­
ister (see the zoom ed-in portion o f figure C.2). The critical path is the com binatorial delay o f 
reading from  a p ipeline stage register (0.2ns), in terconnect (1.44ns), internal delay o f a m ulti­
p lier (1 .07ns), in terconnect (1 .44ns), and w riting to a register (0.1 ns)— giving a total o f 4.25ns. 
This constitu tes a speed-up o f  nearly an order o f m agnitude.
T he resulting throughput can be seen to peak at som e critical path slightly  longer than the 
m inim um  possible (5ns v.s. 4 .25ns). This is because the closest in teger m ultiple o f  the Ins 
m aster clock  is 5ns, so both these geom etries have the sam e iteration rate. However, because 
the latter has m ore pipeline stages, m ore iterations have to be perform ed per line.
167
Pipelining
Figure 5.15: Measured throughput of the pipelined demosaic 3x3 kernel, for a range of target 
critical path length constraints. Both pipelining variants (multi-step and single-step) 
are shown. Throughput hits a wall once the non-pipelineable data paths dominate the 
critical path (in this case the jump chain, which is 4ns). Multi-step pipelining achieves 
a lower throughput for a given pipeline depth, due to the additional step loading times 
incurred at the beginning and end of each line.
1/1 800
Target critical path (ns)
Figure 5.16: Pipeline stages and pipeline stage registers for each pipeline geometry resulting from 
a range of target critical path length constraints for the demosaic 3x3 kernel. Both 
pipelining variants (multi-step and single-step) are shown. Both variants construct 
the same pipeline geometry for each target critical path, but single-step pipelining 
requires more registers in order to feed initial values from the first pipeline stage and 
final values into the last pipeline stage (see section 5.5.2).
The m aximum throughput achieved with m ulti-step pipelining is less than that achieved with 
single-step pipelining, as the increase in the num ber of pipeline stages causes the m ulti-step 
pipelining method to incur additional execution time overheads in term s o f the tim e taken to 
load each additional prologue or epilogue step. As the num ber of pipeline stages increases
168
Pipelin ing
(show n in figure 5.16), the total step load tim es becom e an increasingly  significant fraction of 
the total line execution tim e. Furtherm ore, as the iteration rate increases, the step pre-fetch 
m echanism  becom es less able to hide the step loading tim e: pre-fetching o f  sequentially  ex­
ecuted steps (like the prologue and epilogue) allows the next configuration to be loaded into 
shadow  registers w hilst the curren t context executes. Each o f the prologue and ep ilogue steps 
are executed only once during each line; therefore pre-fetching is only able to hide the portion 
o f  the loading tim e that overlaps w ith the execution tim e (critical path) o f  the current step. The 
step loading tim e is a constant 20ns for the purposes o f  this exam ple, so one w ould expect 
the th roughput resulting from  m ulti-step pipelining to start to lag behind that from  single-step 
pipelining once the critical path drops below  20ns. T he graph (figure 5.15) indeed show s this.
For very shallow  pipelines, m ulti-step pipelining can have a slight advantage in throughput. 
This is because w ith shallow  pipelines, the pipeline stages are often not very balanced, and 
as a result, the prologue and epilogue steps that contain only the first few  p ipeline stages can 
have a shorter critical path than the kernel itself. So long as these critical paths are greater than 
the step load tim e, this causes a reduction in total execution tim e com pared w ith the single- 
step approach, w here the kernel has the sam e critical path irrespective o f  w hich p ipeline stages 
are currently  active. T he reduction in execution tim e achieved w ith m ulti-step p ipelin ing is 
am ortised  over the execution tim e o f  the entire line, so the advantage is negligible unless the 
iteration count is quite low.
Figure 5.17: Improvement in throughput v.s. pipeline depth for the demosaic 3x3 kernel. The 
gains slowly decrease as more pipeline stages are added. This is the case for both 
pipelining variants (multi-step and single-step). This is caused by the additional 
lengths of interconnect needed to feed values to and from each set of pipeline stage 
registers, which extends the critical path.
169
Pipelining
Figure 5.17 shows that the speed-up is not linear w ith the num ber o f pipeline stages— i.e. there 
are dim inishing returns as the pipeline depth increases. This is m ostly due to the idle tim e 
(see figure 5.11 on page 158) becom ing an increasingly significant fraction o f the per-iteration 
execution time, as the critical path decreases. Also, the internal delay o f the pipeline stage reg­
is te rs14 contributes to the total effective critical path, gradually requiring m ore pipeline stages 
to com pensate.
The shapes o f  the pipeline stage and pipeline register graphs (figure 5.16) are very sim ilar, 
which indicates that the register count is roughly proportional to the num ber o f p ipeline stages. 
P ipeline geom etries resulting from  single-step pipelining generally require m ore registers than 
from  m ulti-step pipelining for the same target critical path, since m ore operations are con­
strained to be in the first or last pipeline stages, requiring m ore registers to bring those values 
to/from  the pipeline stage where they are consum ed/created. The m ore pipeline stages there 
are in total, the larger the gap in pipeline stages between the first pipeline stage and the value 
consum er, or the value creator and the last pipeline stage. Therefore the gap in register counts 
increases more than linearly betw een the two approaches.
W ith the dram atic increase in register counts evident in figure 5.16, the highest speed-ups can 
only be achieved with large co res.15 However, as the size o f the core increases, so too does 
the configuration size. The num ber o f additional configuration contexts incurred by m ulti-step 
pipelining m akes this very costly (easily an order o f m agnitude increase in  program  size). This 
is the situation w here single-step pipelining com es into its own.
5.8.1.2 DCT
Reconfigurable architectures are capable o f executing program s with com plex control flow. 
This adds flexibility, allowing the same core to perform  different tasks at d ifferent tim es. It also 
allows the use o f algorithm s too large to be m apped into a single context. This section shows an 
exam ple of this com m on usage pattern, in a discreet cosine transform  filter (8x8 DCT-II) [80] 
com m on in JPEG /M PEG  im age com pression.
The entire 8x8 DCT is too large to im plem ent in a single configuration context. Instead, this 
im plem entation o f the filter perform s an 8 elem ent 1-D DCT for each row o f input data, then 
perform s another 8 element 1-D DCT for each colum n. The 1-D DCT is im plem ented as a 
kernel, which is shown in figure C.3 on page 2 1 1 (the operations are sum m arised in table 5.4). 
The kernel perform s only 8 iterations for each o f the two passes. This makes it a bad candidate 
for pipelining. Even so, the results in table 5.6 show that it is still possible to increase the overall 
perform ance o f the filter using pipelining— a speed-up o f 35% is dem onstrated with single-step 
pipelining (the pipelined kernel can be seen in figure C .4 on page 211).
14and  any add itiona l in te rco n n ec t lead ing  to  and from  them ,
'''w h ic h  have en o u g h  reg is te rs  availab le .
170
Pipelining








read m em ory 8
write m em ory 8
register* 35
Table 5.4: DCT kernel resource requirements, in terms of instruction cells on the target architec­
ture. *This register count does not include pipeline stage registers (since this is before 
pipelining has been applied).
Target (ns) N one 16.0 14.0 12.0 10.0 8.0 6.0
A ctual critical path (ns) 16.8 15.0 12.5 10.5 8.51 7.99 6.71
Total execution tim e (ns) 483 487 441 415 437 483 467
Pipeline stages 1 2 2 2 3 4 4
A dditional registers 0 5 13 9 18 31 25
A dditional contexts 0 2 2 2 4 6 6
T hroughput (M Sam ples/s) 133 131 145 154 147 133 137
Speed-up - -1% 10% 16% 11% 0% 3%
Table 5.5: Performance of a 2-D 8x8 DCT-II filter, for various multi-step  pipeline geometries. The 
kernel only performs 8 iterations in each pass (rows or columns). Kernel critical path 
and total execution time are both shown, since in this case, a faster kernel doesn’t nec­
essarily lead to faster execution overall. This is because the step loading time for the 
additional fill and flush steps can become an appreciable fraction of the total execution 
time of the original kernel.
Target (ns) None 16.0 14.0 12.0 10.0 8.0 6.0
A ctual critical path (ns) 16.8 15.0 12.5 10.5 8.51 7.99 6.71
Total execution tim e (ns) 483 483 429 395 381 399 359
Pipeline stages 1 2 2 2 3 4 4
A dditional registers 0 16 20 16 36 56 50
A dditional contexts 0 0 0 0 0 0 0
T hroughput (M Sam ples/s) 133 133 149 162 168 160 178
Speed-up - 0% 13% 22% 27% 21% 35%
Table 5.6: Performance of a 2-D 8x8 DCT-II filter, for various single-step  pipeline geometries.
Compare with table 5.5. Single-step pipelining in this case imposes no additional config­




Figure 5.18: Measured throughput of the pipelined DCT kernel, for a range of target critical path 
length constraints. Both pipelining variants (multi-step and single-step) are shown. 
Only a minor improvement in throughput is seen, because the increase in iteration 
count imposed by pipelining is a substantial fraction of the total iteration count. Fluc­
tuation is due to the relative effect of idle time (see section 5.6) and the increase in 
iteration count for every pipeline stage introduced. The additional step loading time 
incurred by multi-step pipelining can be seen here to nullify the effect of the reduction 
in critical path.
M ulti-s tep  p ipe lin ing  
- S ingle-step p ipe lin ing
Target critical path (ns)
Figure 5.19: Pipeline stages and pipeline stage registers for each pipeline geometry resulting from 
a range of target critical path length constraints for the DCT kernel. Both pipelining 
variants (multi-step and single-step) arc shown. The non-pipelineable data paths in 
this example have a critical path of 6.5ns. Additional registers are needed for single- 
step pipelining, despite the same pipeline geometry, in order to feed initial values from 
the first pipeline stage and final values into the last pipeline stage (see section 5.5.2). 
Register count isn't directly related to the number of pipeline stages; the shape of 
the data path determines this, as it depends on how many data paths are split at the 
particular point along the critical path where each pipeline stage is added.
172
Pipelin ing
T he kernel has a short natural critical path: 16.8ns. This is less than the step load tim e. H ow ­
ever, the kernel in this exam ple relies on the shared random  access data  m em ory interface 
[61](section 3.5), w hich incurs additional dynam ic delays w ith each m em ory access. T he ker­
nel uses 16-bit sam ples, reading-in 8 per iteration, and w riting 8 per iteration. T he architecture 
used for this exam ple has 8 independent single-port 8-bit m em ory banks, allow ing 4 sam ples to 
be read or 4 sam ples to be w ritten in parallel. The m em ory accesses are serialised  at run-tim e 
by the m em ory arbiter hardw are. T he data m ust be correctly  aligned in o rder fo r this data par­
allelism  to be achievable. This num ber o f  banks is quite high, im posing an area overhead for 
the arb iter logic. However, w ithout this parallelism , this exam ple becom es m em ory bandw idth 
constrained, in w hich case pipelining the data path w ou ldn’t im prove the iteration rate o f  the 
kernel.
A ssum ing this degree o f  m em ory parallelism  is possib le on the hardw are, the actual per- 
iteration execution tim e o f the kernel is given by the follow ing equation:
te x e c  =  ic p  T  T to ta l / l lb a n k s  X t la l  en cy  T VHotal / ^ b a n k s  •*' t la te n c y
W here:
t exec: total kernel step execution tim e (one iteration). 
t cp : critical path delay rounded up to  the next RRC cycle. 
t la te n c y ' data m em ory latency. 
r to ta l• num ber o f  8-bit read operations in the kernel. 
wtotal' num ber o f 8-bit w rite operations in the kernel. 
n b a n k s ' num ber o f independent 8-bit m em ory banks.
This m eans that the original (non-pipelined) kernel has a per-iteration execution tim e of: 
texec. = 17ns + 1 6 /8  x  2ns + 1 6 /8  x  2ns = 25ns
This is slightly larger than the step load tim e. In this exam ple, the low iteration count means 
that the execution tim e o f  additional filling and flushing iterations can be significant com pared 
to the total execution tim e o f  the kernel.
173
Pipelining
For m ulti-step pipelining, this leads to the best throughput being achieved when the overhead 
o f loading and executing the additional steps is less than the reduction in per-iteration execution 
time of the kernel achieved through pipelining. Table 5.5 (and figure 5.18) shows this occurring 
with 2 pipeline stages and a data path critical path of 10.5ns, giving a 16% im provem ent in 
throughput. Taking into account the dynam ic delays, the actual kernel per-iteration execution 
tim e is:
texec = liras -I- 16/8 x 2ns + 16/8 x 2ns = 19ns
Pipelining deeper than this decreases the per-iteration execution tim e to below the step loading 
time, which offsets the advantage.16
For single-step pipelining, this leads to the best throughput being achieved w hen the overhead 
o f the additional iterations o f the kernel is less than the reduction in per-iteration execution 
tim e o f the kernel achieved through pipelining. Table 5.6 (and figure 5.18) shows this occurring 
w ith 4 pipeline stages and a data path critical path o f 6.71ns, giving a 35%  im provem ent in 
throughput. This is the deepest pipeline possible, with the critical path being determ ined by a 
non-pipelineable feedback loop involving a counter.
W hen single-step pipelining, reading from  m em ory must occur in the first pipeline stage, and 
w riting to m em ory must occur in the last pipeline stage. This m eans that m ore registers are 
needed for a given pipeline dep th .17 It also means that each iteration o f the kernel during 
pipeline filling will have only m em ory reads, and each iteration o f the kernel during pipeline 
flushing will have only m em ory writes. Thus the dynam ic delay during filling or flushing is half 
that during norm al kernel loop iterations. This slightly offsets the effect o f the increase in the 
num ber of iterations, w hich is why the point o f highest throughput occurs at a higher pipeline 
depth than with m ulti-step pipelining. It is also the reason for the h igher m axim um  achievable 
throughput with single-step pipelining.
5.8.2 Results: Internally Pipelined Cells
As in the previous section, the exam ple in this section was im plem ented in C, targeting a 65nm  
RICA core with sufficient resources to im plem ent the resulting kernel (i.e. around 250 cells 
and an abundance o f registers). The m aster clock (RRC) period is 1.0ns. O n-chip SRAM  
memory latency is 2.0ns, and the step load time is 20.0ns (no com pression), with the capability 
to pre-fetch one step in advance. A typical kernel will have a critical path in the range 20-80ns. 
N on-pipelineable data paths such as the jum p chain are around 4-7ns.
The intention o f this experim ent was to dem onstrate the effectiveness o f providing internally  
pipelined cells. With internally pipelined cells, operations that would otherw ise contribute 
to the non-pipelineable data paths that determ ine the m inim um  achievable pipelined critical 
path, can be am ortised over m ultiple iterations, rem oving them  from  the critical path. The 
m ost com m on case of this is with m emory access, so a memory bandw idth intensive stream ing 
application was chosen.
since  the  load ing  cost can  on ly  be am o rtised  over the  very  sm all n u m b e r o f  kerne l ite ra tio n s ,
















Table 5.7: G am m a correction filter kernel resource requirem ents, in term s of instruction cells 
on the target architecture. ^M utually exclusive, for the two different im plem enta­
tions. **This register count does not include pipeline stage registers (since this is before 
pipelining has been applied).
Target critical path (ns) None 30.0 20.0 15.0 10.0 8.0 6.0 5.0
Actual critical path (ns) 44.8 27.9 19.3 14.9 9.82 7.81 5.82 5.37
Line execution time (/ts) 30.7 19.9 14.1 10.9 8.40 7.14 5.88 5.29
Pipeline stages 1 2 3 4 6 9 13 20
Additional registers 0 30 43 65 106 155 237 338
Additional contexts 0 0 0 0 0 0 0 0
Throughput (MPixels/s) 41.6 64.4 90.5 117 152 179 218 242
Speed-up - 55% 117% 181% 266% 330% 422% 481%
Table 5.8: Perform ance of the gam m a correction filter kernel before pipelining, and afte r pipelin­
ing, using com binatorial m emory operations (RMEM). Results are given only for the (un­
realistic) hardw are where all reads are perform ed in parallel. This gives a fa irer com ­
parison with table 5.9. The num ber of pipeline stages, critical path , and th roughput 
increase in the usual m anner as the target critical path  constra in t is tightened, achiev­
ing a maxim um  speed-up of nearly 5 x .
Target critical path (ns) None 30.0 20.0 15.0 10.0 8.0 6.0 5.0
Actual critical path (ns) 45.4 28.0 19.3 14.9 9.82 7.81 5.82 4.92
Line execution time (/as) 71.0 18.0 12.9 9.87 6.60 5.30 4.00 3.38
Pipeline stages 1/4* 5 6 18 19 22 26 33
Additional registers 0 84 108 329 342 406 471 590
Additional contexts 3* 0 0 0 0 0 0 0
Throughput (MPixels/s) 18.0 70.9 99.1 130 194 241 320 379
Speed-up - 294% 450% 620% 977% 1239% 1673% 2004%
Table 5.9: Perform ance of the gam m a correction filter kernel before pipelining, and afte r pipelin­
ing, using internally pipelined m em ory operations (SRBUFJRAM). Com pare with ta ­
ble 5.8. *W ithout pipelining, the kernel has to be split into m ultiple steps to com pensate. 
As a result, the non-pipelined throughput is very low, and thus distorts the speed-up 
values. Com paring the line execution times instead, this version achieves a pipelined 
th roughput 35% higher than in table 5.8. This is due to the m em ory latency being h id­
den by overlapping it across multiple iterations, com pared to the RMEM example where 
the latency has to appear in a single iteration, extending its critical path.
175
Pipelining
The exam ple used in this section is a gam m a correction m odule [81 ] from  a typical im age signal 
processing pipe. The com plexity is shown in table 5.7. G am m a correction is applied in this case 
to two RGB pixel stream s at once. Each sam ple is 16 bits. There are two 16-bit table look-ups 
required per sam ple— one look-up from  each o f the two 64 entry tables (base and gradient). 
This means that 12 16-bit memory operations are required per iteration o f  the kernel. The 
input and output stream s are im plem ented via separate data interfaces, so do not consum e data 
m em ory bandw idth. The table look-ups im pose a significant dem and on m em ory bandw idth. 
Two im plem entations are com pared here: normal com binatorial data m em ory interface v.s. 
m ultiple independent stream /line buffers in random  access m ode (internally pipelined).
As m entioned in the previous section, the normal data memory interface hardw are on RICA 
provides arbitration to serialise conflicting m em ory operations. This allows the m em ory band­
width to be increased slightly, w hilst still m aintaining a single m em ory address space— i.e. it 
is a trade-off betw een ease o f program m ing and bandw idth/area. A rea grows exponentially  as 
more m em ory banks are added to increase the bandw idth. The interface is com binatorial: the 
address is sam pled in the current iteration, then at the next clock edge, that address is accessed 
in the m em ory interface, and the value stored there is returned for use in the current iteration. 
This effectively causes the critical path of the step to be increased by the m em ory latency tim es 
the num ber o f operations that are queued, plus som e idle tim e to round up to  the next m aster 
clock period.
The first im plem entation of the kernel uses the norm al data m em ory interface, represented by 
the RMEM instruction. The kernel is shown in figure C.5 on page 212, w ith resources sum ­
m arised in table 5.7. The sam e code is run on two different variants o f the hardware: a realistic 
interface consisting of 8 independent 8-bit banks, allowing 4 16-bit read operations in parallel; 
and an unrealistic interface consisting o f 24 independent 8-bit banks, allow ing all 12 16-bit 
read operations to occur in parallel. In the first hardw are variant, each iteration o f  the kernel 
requires three sequential reads from  each bank. In the second variant, each iteration requires 
only one read from each bank. This allows for a more fair com parison with the stream  buffer 
im plem entation, described next.
A second im plem entation o f the gam m a correction kernel uses dom ain-specific stream  buffers 
(line m em ories), with corresponding interface cells in the core, which are represented by the 
SRBUF_RAM instruction. A typical ISP core would contain many o f these, so it is possible 
to achieve very high memory bandwidths for small data sets like the look-up tables in this 
exam ple. These stream  buffers are designed to be accessed sequentially, where each successive 
location is autom atically de-referenced and returned at the beginning o f each iteration o f  a 
kernel. This hides the memory access latency by fetching the next sam ple w hilst the current 
iteiation oi the kernel is executing. This is only possible because the next address is known 
in advance. In oidei to peiform  random  access whilst still being able to hide the m em ory 
latency, the operation has to be pipelined— i.e. the result appears at the output o f  the cell 
several iterations later. In the hardw are used here, the SRBUF.RAM instruction is pipelined into 
4 stages, m eaning that the result appears 3 iterations after where the address was sam pled.
Tables 5.8 and 5.9 and figure 5.20 show the relative perform ance o f each o f these scenarios, 
ovei a tange ot pipeline taiget ciitical path constraints (im age size 640x474). Only single-step 
pipelining is dem onstrated here. The pipelined kernel data flow graphs can be seen in figure C.6 
on page 213 and figure C.8 on page 215.
176
P ipelining
Figure 5.20: Measured throughput of the non-pipelined and single-step pipelined gamma correc­
tion kernel, for a range of target critical path length constraints. Compares com­
binatorial memory read operations (RMEM) (with two different memory interface 
characteristics) and internally pipelined read operations (SRBUF_RAM). Without 
pipelining, the internally pipelined cells lead to a spuriously low throughput, due to 
the kernel having to be split into multiple steps in order to bring the output into 
sync with the input. The throughput ceiling in the RMEM cases is determined by 
the memory latency (or a multiple of this when the reads have to be staged), as the 
memory accesses must occur entirely within one iteration (and pipeline stage). Mem­
ory latency is distributed across multiple iterations in the SRBUF_RAM case, so the 
throughput ceiling there is determined by other non-pipelineable data paths such as 
the jump chain. A knee can be seen at about 16ns, where integer multiples of the 
memory latency, master clock, and non-pipelineable data paths in the step happen to 
align.
T he data points at the far left o f the graph show the kernels w hen not pipelined. N ote that 
the perform ance o f  the im plem entation using the in ternally  pipelined stream  buffer cells is 
spuriously low when the kernel is not pipelined. This is because the kernel m ust be split into 
m ultip le steps (4) in o rder to take account o f the cycles o f  latency resulting from  the internal 
p ipeline in the interface cells (show n in figure C .7 on page C .7). This m akes the kernel program  
m em ory bandw idth  lim ited, since the core m ust be reconfigured four tim es per iteration o f  the 
kernel. It also increases the total critical p a th 18, due to the m iddle steps w hich ju s t wait for data 
to propagate through the corresponding pipeline stage o f  the internally  pipelined cells.
As can be seen in figure 5.21, w hen pipelining is enabled, the im plem entation using SRBUF_RAM 
pipelines to a m inim um  o f 5 stages, in  o rder to fit around the internal p ipelin ing o f the interface 
cells— this roughly translates to one stage containing the logic that generates the address, then 
4 stages corresponding to the internal pipeline, w ith the last stage also containing logic that 
operates on the result.
18w h ich  is th e  sum  o f  th e  c ritica l p a th s  o f  e ach  step .
177
Pipelining
— * —  Combinatorial (RMEM), 4 x l6 -b it  reads in parallel 
— ♦—  Combinatorial (RMEM), 1 2 x l6 -b it reads in parallel 
— ▼—  Internally pipelined (SRBUFRAM)
700 -i
QJ 600 -
500 -QJ 400 -roc 300-o 200-
•O 100 -
< 0-
Combinatorial (RMEM), 4 x l6 -b it  reads in parallel
-  Combinatorial (RMEM), 1 2 x l6 -b it reads in parallel
- Internally pipelined (SRBUF RAM)
25 20
Target critical path (ns)
i/i 35-
(h 311-to•Ml/l 2b-




Target critical path (ns)
Figure 5.21: Pipeline stages and pipeline stage registers for each pipeline geometry resulting from 
a range of target critical path length constraints for the gamma correction kernel. 
Compares combinatorial read operations (RMEM) and internally pipelined read op­
erations (SRBUF RAM). The kernel using SRBUF RAM when not-pipelined, has to 
be split into multiple steps. This is represented by a fractional pipeline stage count. 
With SRBUF RAM, a minimum pipeline of 5 stages is needed to work around the in­
ternal pipeline of the memory access cell. A sudden jump in pipeline stages (and thus 
also pipeline stage registers) is seen going from 20ns to 19ns, where the pipeline stages 
introduced to work around the internal pipeline of the cells are insufficient to meet 
the critical path constraint, and so the data paths suddenly also need to be pipelined. 
This is followed by the usual exponential rise, until the upper limit is reached (caused 
by other non-pipelineable data paths).
For any given (achievable) target critical path, the actual perform ance of the im plem entation us­
ing com binatorial m em ory access cells (RMEM) is lower than that w ith the synchronous m em ory 
access cells (SRBUF). This is due sim ply to the m em ory latency extending the critical path. As 
a result, the version where reads have to be perform ed sequentially in batches, has low er perfor­
mance than the one where they are all perform ed in parallel. The perform ance curves are a very 
sim ilar shape, where the perform ance gap widens as the (constant) memory latency becom es 
an increasingly significant fraction of the total execution time o f an iteration o f the kernel.
F igure 5.22 shows how the addition o f pipeline stages affects the speed-up (pipelined through­
put com pared to non-pipelined throughput). Both scenarios using com binatorial m em ory access 
cells show a logarithm ic im provement. The relationship with internally pipelined cells shows a 
sim ilar logarithm , but with a steep start and a step change shortly after. Below this step change, 
there is an artificially high pipeline stage count (high target critical paths), to work around the 
fixed internal pipeline depth o f the memory access cells. This pipeline stage count jum ps up 
once the kernel pipeline geom etry begins to becom e m ore com patible w ith the internal pipeline 
geom etry of the cells.
178
Pipelining
Figure 5.22: Improvement in throughput v.s. pipeline depth for the gamma correction kernel. The 
gains decrease logarithmically as more pipeline stages are added. This is due to the 
additional interconnect length introduced by inserting pipeline stage registers. The 
SRBUFJRAM case shows a step change corresponding to the knee seen in figure 5.21.
T he m inim um  possib le pipelined critical path is low er w ith internally  p ipelined m em ory access 
cells, since (as with other synchronous cells) the com binatorial delay is very short— the output 
o f  the cell behaves ju s t like reading from  a register; inputs to the cell are sim ply sam pled at the 
end o f  the iteration, like when w riting to a register. T herefore, the in ternally  pipelined m em ory 
access cells show a significant perform ance advantage even over the unrealistic case w here 
the com binatorial m em ory access interface has equal m em ory bandw idth. T he cost o f using 
internally  pipelined cells com es in the form  o f  register counts, since m ore p ipeline stages are 
needed to keep the data paths involving com binatorial cells in sync w ith the data paths involving 
in ternally  pipelined cells. However, in large cores suitable for im age signal processing, registers 
are in abundance.
5.8.3 Results: Automatic Timing Constraint
This section dem onstrates the effect o f  clock quantisation on the m axim um  achievable pipelined 
throughput, by pipelining exam ples on hardw are with d ifferent m aster clock periods. T he algo­
rithm  for the autom atic choice o f tim ing constraint (section 5.6) takes advantage o f  this effect to 
determ ine w hat the m inim um  achievable pipelined critical path should be, and uses that as the 
target. To m ake the effect o f  clock quantisation m ore pronounced, a slow er technology process 
(1 80nm ) was used, to make the resulting idle tim e a higher fraction o f  the p ipelined critical 
path. Two applications w ere tested to dem onstrate that the effect is not application  dependent.
179
Pipelining
The single-step pipelining algorithm  with autom atic choice o f tim ing constraint was applied to 
two real-life applications: a 7-line H am ilton dem osaic filter [98], and a m ultiplication-based 
iterative softw are division algorithm . The dem osaic involves interpolating m issing colour com ­
ponents from  the Bayer output o f  a colour filter array sensor. Division on a per-pixel level is 
used as part o f m any com m ercial noise reduction filters. Both are h igh-throughput tasks nor­
mally done on-chip as part o f a custom  im age signal processing (ISP) pipeline, used in m odern 
digital cam eras and m obile phones. Both kernels w ere im plem ented on a reconfigurable in ­
struction cell-based processor [5] (180nm tim ing figures), using the C language. Softw are 
optim isation techniques w ere used to reduce the main kernel in each case into a basic block 
small enough to fit onto the target architecture in a single configuration context. Both exam ple 
kernels produce a single output pixel per iteration.
Master clock period (ns) 20.0 15.0 10.0 5.0 3.0 2.0 1.0
Pipeline stages 5 7 5 7 9 9 11
Pipeline stage registers 80 123 80 123 153 153 189
Min. possible constraint (ns) 10.95 10.95 10.95 10.95 10.95 10.95 10.95
Non-pipelined critical path (ns) 77.0 77.0 77.0 77.0 77.0 77.0 77.0
Pipelined critical path (ns) 19.8 14.65 19.8 14.65 11.55 11.55 11.00
Improvement in critical path 389% 526% 389% 526% 667% 667% 700%
Non-pipelined iteration time (ns) 80.0 90.0 80.0 80.0 78.0 78.0 77.0
Pipelined iteration time (ns) 20.0 15.0 20.0 15.0 12.0 12.0 11.0
Improvement in iteration time 400% 600% 400% 533% 650% 650% 636%
Pipelined throughput (MPixels/s) 50.0 66.6 50.0 66.6 83.3 83.3 90.9
Speed-up 400% 600% 400% 533% 650% 650% 636%
Table 5.10: Performance of the Hamilton demosaic filter kernel before and after automatic 
pipelining, for a range of different master clock periods. See section 5.8 for an ex­
planation of the results.
The perform ance o f the pipelining for both cases is shown in figure 5.23, and som e additional 
details are given for the Hamilton dem osaic in table 5.10. The independent param eter in these 
experim ents is the m aster clock (RRC) period— NOT the pipeline tim ing constraint (which was 
used in the previous experim ents). This represents a physical difference in the hardw are, rather 
than ju s t a com pile-tim e setting in the tools. The experim ents in the previous sections show 
that simply pipelining to a sm aller target pipeline tim ing constraint doesn’t necessarily lead to 
an im provem ent in iteration rate, but does lead to an increase in resources. For exam ple in 
table 5.3, a 4.0ns target leads to a deeper pipeline (more stages) than a 5.0ns target, yet the 
iteration rate is the same. This is due to quantisation: the error between the tim e taken for each 
data path fragm ent in each pipeline stage to com plete and the closest integer m ultiple o f  the 
m aster clock period. The autom atic timing constraint algorithm  chooses a target w hich results 
in the least quantisation. The experim ents in this section show the effectiveness o f this, w ith 
different levels o f quantisation— i.e. different m aster clock periods.
The main trend to notice is the ability for the m axim um  achievable iteration rate (after pipelin­
ing) to genet ally increase as the m aster clock frequency is increased. Since the same underlying 
data path is used in each case, the non-pipelined critical path length is constant. The iteration 
tim e of the non-pipelined data paths is ju st the critical path length rounded up to the next in te­
ger m ultiple o f the m aster clock period. As the m aster clock period is decreased, the algorithm
180
P ipelining
Figure 5.23: Throughput before and after automatic pipelining, for a range of different master 
clock periods, for two pixel-level code examples: Hamilton demosaic and iterative 
software division. The theoretical line shows what could be achieved if the master 
clock were of infinite frequency, based on the longest indivisible critical path (the iter­
ation control logic in both of these cases). The throughput of the non-pipelined cases 
are determined by the critical path of that particular kernel, whereas the through­
put of the pipelined cases are determined by the minimum integer multiple of the 
master clock able to cover the non-pipelineable data paths. The throughput of the 
non-pipelined cases vary only slightly, since the change in idle time caused by dif­
ferent master clock frequencies represents only a small fraction of the kernel critical 
path. The pipelined cases however show a pronounced exponential trend related to 
the idle time as a fraction of the master clock period, which wraps around when­
ever the pipelined critical path coincides with an integer multiple of the master clock 
period.
is able to produce a pipeline w ith a critical path closer to the theoretical m in im um 19. H ow ­
ever, the num ber o f p ipeline stages required to do this increases in a faster than linear fashion. 
This is another effect o f quantisation: as the pipeline stages get shorter, the relative size o f  the 
indivisible units being p ipelined20 increases com pared to the resolution  o f  the m aster clock. 
T he algorithm  does well in m inim ising this effect, and the percentage im provem ents w ith and 
w ithout the effect o f  the m aster clock are relatively close in all cases.
T he pipeline geom etries contructed for each m aster clock frequency setting are shown in fig­
ure 5.24. Both exam ples show identical post-p ipelin ing throughput (iteration rate), as both 
cases have the sam e longest indivisible critical path— corresponding to  the iteration  control 
(jum p) logic (shown by the theoretical line in figure 5.23). There are no data dependencies or 
o ther constrain ts lim iting the potential for p ipelining in either exam ple. If  data dependencies, 
feedback loops, or o ther constraints w ere present, these would be reflected by  a larger indivis­
ible critical path. The shorter the indivisible critical path, the m ore im portant the behaviour o f 
the autom atic pipelin ing algorithm .
19as d ic ta te d  by  th e  in d iv is ib le  d a ta  p a th s  such  as fe e d b ac k  lo o p s , and  the  ju m p  c o n d itio n  chain .
2Ui.e . th e  in te rn a l d e lay s  o f  each  cell and  sec tio n  o f  in te rco n n e c t.
181
Pipelining
Figure 5.24: Pipeline geometry from automatic pipelining, for a range of different master clock 
periods, for two pixel-level code examples: Hamilton demosaic and iterative software 
division. The two examples have a different non-pipelined critical path, so the num­
ber of pipeline stages differ, despite the pipelined throughput being the same (see 
figure 5.23). The automatic timing constraint algorithm can be seen to reduce the 
number of pipeline stages accordingly, when the idle time prevents a higher through­
put being achieved. This has a significant effect on reducing register consumption. 
As before, there is some randomness in the number of registers required for a given 
pipeline depth, as this depends on where along the critical path the registers have 
been inserted (i.e. how many data paths span across pipeline stages).
The resource-saving effect o f the algorithm  can be seen to com e into effect each tim e the cur­
rent integer m ultiple o f the m aster clock frequency drops below the indivisible critical path 
length. This makes the iteration rate curve appear to wrap around each tim e it tries to cross 
the theoretical m axim um  iteration rate line. One such boundary is identified in figure 5.25. By 
extending the length of the pipeline stages up to the next m aster clock period, the num ber of 
registers is m inim ised, which avoids needless congestion on the interconnect. The reduction in 
the number of pipeline stages reduces the configuration size and the latency, since few er filling 
and flushing iterations need to be perform ed.
182
P ipelin ing
Figure 5.25: The graphs of figures 5.23 and 5.24 to show the correlation between them. When 
the clock frequency is such that maximum theoretical throughput can be obtained, 
the appropriate number of pipeline stages are created, as shown by the line on the 
left. The line on the right shows that when the clock frequency is such that the max­
imum achievable throughput is significantly less than the theoretical maximum, the 
algorithm relaxes the number of pipeline stages— and thus saves resources—without 




This chapter described algorithm s used to drastically improve the perform ance o f com pute- 
intensive loops. The data paths o f the configuration context corresponding to a loop body can 
be split into pipeline stages. This decreases the critical path of the loop, increasing the iteration 
rate, and thus the throughput, at the cost o f additional registers. For large cores, pipelining can 
lead to near A SIC levels o f  perform ance.
The first method (m ulti-step pipelining) is a software-only solution that uses dynam ic recon­
figuration to perform  pipeline filling and flushing (by adding configuration contexts w ith som e 
pipeline stages om itted).
A second m ethod (single-step pipelining) was proposed w hich rem oves the need for additional 
configuration contexts, at the cost of some flexibility and m inor m odifications to the hardw are. 
This is particularly  advantageous on large cores, w here the num ber o f  pipeline stages and the 
m em ory required to store a configuration context are both large.
M ulti-step pipelining w orks well with small cores, w here register counts are low, and the cost 
o f additional configuration contexts is low. Single-step pipelining is m ost suited to large cores, 
w here registers are in abundance, and w here very deep pipelines can achieve significant gains 
in throughput. S ingle-step pipelining is also o f use in loops with low iteration counts.
An algorithm  for com pletely autom ating the task o f pipelining was dem onstrated, w hich au­
tom atically chooses a suitable pipeline target critical path constraint for the pipeline stage as­
signm ent algorithm  to operate on, such that the maxim um  possible throughput is achieved with 
m inim al cost in registers.
Further im provem ents w ere dem onstrated by the addition o f support for internally pipelined 
cells. This allows high-latency operations such as m em ory access to be am ortised over m ulti­
ple iterations, reducing the critical path, and thus further increasing the throughput. D ynam ic 
pipelines are constructed around the internal pipeline depth o f each cell. The cost o f  this is 
generally a small increase in registers, as additional pipeline stages are often needed to bring 
other values into sync with the results o f  internally pipelined operations, when the data paths 
in a configuration context happen to not line up well w ith the internal pipeline depth o f  the 
cells. Internally pipelined cells are also inefficient when used w ithout pipelining, as the internal 
pipeline depth has to be com pensated for by the addition o f configuration contexts instead of 
pipeline stages.
The next chapter concludes the thesis by restating what was done, what was shown, its signifi­




T he overall objective o f the w ork presented in this thesis was to develop tools to allow  dynam ­
ically reconfigurable com puting architectures to be program m ed and w orked w ith in a m anner 
sim ilar to how m icroprocessors are. The tools m ust m ap high-level A N SI C code to configura­
tion contexts as efficiently as possible, with as little m odification to the source code as possible. 
F urtherm ore, the design cycle m ust be fast, to  allow  a high design iteration rate. To m ake 
this possible, a h igh-speed sim ulator is needed. The w ork was in tended to be as generic as 
possible, to apply to as w ide a range o f architectures as possib le— i.e. the tool chain m ust be 
re-targetable. However, it was necessary to bound this in som e way: the w ork targets a fam ily 
o f dynam ically  reconfigurable arrays w ith the follow ing properties:
•  T he reconfigurable array is in control o f its own reconfiguration.
•  C oarse-grained, w here each functional unit supports operations sim ilar to those in a typ­
ical RISC instruction set.
These tw o properties m ake the array sufficiently com parable to  a m icroprocessor, so that a con­
ventional C com piler can target them : being in control o f  its ow n reconfiguration allow s the 
array to perform  control flow; functional units supporting  R ISC-like instructions allow  m atch­
ing expressions to be w ritten in a com piler back-end. Furtherm ore, coarse-grained architectures 
have a sm all configuration size, w hich m eans that they can be reconfigured m uch m ore rapidly 
than finer grained data path m achines (such as FPGA s).
T hese properties m ean that the w ork presented here is best suited to a  single fam ily o f dynam ­
ically reconfigurable array: the reconfigurable instruction cell array (RICA  [5]). The concepts 
could be easily  applied to  a w ider range o f architectures. However, this fam ily o f  architectures 
represents a significant design space: the num ber o f  cells, the way they are interconnected, and 
the specific functionality  o f  each cell type, is entirely open to exploration using these tools.
The sections o f  this chapter correspond to each o f the m ain chapters in this thesis: em ula­
tion, scheduling, and pipelining. They begin by restating  the problem  description, aim s and 
objectives that w ere m ade in the corresponding chapter, then show how the theories and re­
sults presented in those chapters satisfy these goals. The chapter concludes w ith an overall 




6.1.1 Emulation: Problem Description
In order to validate application code and explore the design space, a high-speed sim ulation of 
the target architecture must be available.
Existing softw are-based m ethods of sim ulation for reconfigurable com puting architectures are 
event-driven, and incur a sizeable tim e penalty for every configuration context. For instruc­
tion cell architectures, which have to be reconfigured many millions o f tim es per second, this 
overhead eclipses the actual w ork done by the m odelled processing units.
Softw are-based em ulators are high-speed sim ulations available for m ore conventional m icro­
processor architectures. However, these do not provide support for operation chaining, w hich 
makes them  unsuitable for data path m achines, such as those considered by this thesis.
6.1.1.1 Aim s
•  Provide a softw are sim ulator to allow the design search space to be explored w ithin a 
reasonable tim e frame.
•  Allow rapid application developm ent and validation.
6.1.1.2 O bjectives
G eneric: It must be easy to describe the target architecture (e.g. resource counts, tim ing fig­
ures), and the sim ulation adapt accordingly.
Extensible: It must be easy to add new functionality (e.g. cell types), preferably using a high- 
level description.
Fast: The sim ulation should be as close to real-tim e as possible.
Accurate: The sim ulation should behave as much as possible like the target architecture at a 
given level o f abstraction, and should give a reasonable estim ate o f the tim ing.
6.1.2 Emulation: Demonstrated Outcomes and Contribution to Knowledge
A novel approach was suggested to reduce the per-step execution overhead seen in other sim ­
ulators lo r data path architectures. This involves m oving the resolution o f the dependencies in 
the data paths into a pre-processing stage, prior to execution. W hen applied to an exam ple pro­
cessor, the results (section 3.4 on page 45) show that the execution speed achieved using this 
new approach is around two orders o f magnitude higher than an equivalent System C  model, 
and largely matches the speed of an FPGA model o f the target reconfigurable instruction cell 
array. For the exam ples shown, this corresponds to a few percent o f real-tim e perform ance.
186
Conclusions
This level o f  perform ance m akes the proposed em ulator suitable for use in feedback-directed  
optim isation , and thus could be an im portant part o f  fu ture too lchains. This therefore satisfies 
the objective o f  execution speed. N ote how ever that the perform ance decreases linearly  with 
array size and utilisation. R eal-tim e sim ulation o f  sizeable arrays w ith h igh-utilisation  config­
uration contexts is unfeasible on any sim ulation running on a m icroprocessor, as the potential 
th roughput o f  such arrays substantially  exceeds the theoretical th roughput o f  any m icroproces­
sor.
T he em ulator is highly adaptable to d ifferent types o f  reconfigurable processors w ith d ifferent 
functionality  but sim ilar control concepts, m aking it a good candidate fo r use in retargetable 
toolchains for hardw are/softw are co-design. T he cell count and m ix can be freely m odified 
at run -tim e,1 m aking the sim ulation quite generic. T he functionality  is extensib le by  allow ing 
new  types o f  instruction cell or m em ory-m apped hardw are to be expressed at a high-level using 
C++. T hese becom e available after recom piling the em ulator.
The accuracy o f  the proposed sim ulation can be broken down into tw o com ponents: functional 
accuracy, and tim ing accuracy. T he functional accuracy is a  function o f  how well the C++ 
descrip tion  o f  each cell type m atches the behaviour o f the real cell. T he serialisation  algorithm  
ensures that the data flow occurs in an analogous m anner to  that in the real data paths in  the 
target architecture. This guarantees that the state o f  the sim ulation  m atches that o f  the real 
system  at the transition betw een each configuration context iteration. This is the sam e level o f 
state coherence inherent in the design o f  the target architecture.
Sim ilarly, tim ing accuracy com es by design— each configuration context specifies how many 
m aster clock periods it should persist for. The step loading tim e can also be know n in advance. 
However, som e hardw are such as the shared data m em ory access arbiters in troduce a dynam ic 
delay, depending on contention w hilst executing. Such data-path  dependent delays cannot ef­
ficiently be m odelled, w ithout sacrificing significant perform ance. This doesn ’t turn out to be 
m uch o f  a problem , as in ternally  pipelined cells are a m ore efficient w ay to deal w ith m em ory 
dependencies, so arrays supporting these typically do not incur dynam ic delays. N ote that the 
execution rate o f  the sim ulation is not a constant fraction  o f  the real-life execution tim e on the 
target architecture.
6.1.3 Emulation: Further Work
The serialisation algorithm  could be applied directly  to  translation, allow ing even faster em ula­
tions to  be perform ed. This should narrow  the gap betw een the execution tim e o f  the em ulation  
v.s. the original application code com piled  natively for the m icroprocessor. This w ould be 
achieved by using the output o f the serialisation algorithm  to generate static call lists for each 
configuration context, w hich are then fed into an optim ising linker (such as LLVM  [60]) to gen­
erate an optim ised, native binary, elim inating  the overhead o f in terpretation  [99]. N ote how ever 
that a perform ance gap m ay still exist, due to the relative m aturity o f  the optim isations available 
in the host com piler v.s. that o f  the target reconfigurable architecture.
'p r io r  to  lo ad in g  th e  co n fig u ra tio n  con tex ts.
187
C onclusions
For m ore com plex cell types, w here som e inputs o f a given cell are com binatorial w hilst other 
inputs o f the same cell are synchronous, the concept o f disjoint cells no longer applies. T he 
serialisation algorithm  in its current form  is unable to deal with such cell types. This problem  
can be resolved by m aking the disjoint property apply to each input individually (instead of 
the output), and altering the serialisation algorithm  accordingly. The serialisation algorithm  de­
scribed in this thesis determ ines when a given operate action is ready to be added by looking at 
the output o f each operation feeding the inputs o f  the current operation. This w ould be m odified 
by serialising in reverse, allowing each input to be considered independently. C om binatorial 
inputs would be sam pled in the operate action, and synchronous (disjo in t) inputs w ould be 
sam pled in the update  action. N ote that this requires that inputs can be sam pled in the update  
action. The update  actions are serialised in arbitrary order, so to avoid corruption o f  the data 
paths, it is im portant that the cell outputs are not m odified in any update  action. This im poses 
a new design rule when im plem enting new instruction cell types.
M ulti-core: D ue to the execution rate being dependent on the data path com plexity, each con­
figuration context may execute at a different fraction o f real-tim e perform ance. As a result, the 
sim ulation o f  m ultiple cores running different program s will not be able to consistently  run at 
m axim um  possible execution rate, w ithout losing synchronisation with the other cores.2 This 
means that perform ance doesn’t necessarily scale well w ith the num ber o f  cores. However, 
m easures can be taken to m inim ise this loss o f perform ance, by only synchronising the threads 
at changes in state that are visible to the other cores.3
w h en  each  co re  is ex ecu ted  as a sep a ra te  th read , 




6.2.1 Scheduling: Problem Description
A com plete toolchain fo r allow ing a dynam ically  reconfigurable architecture to be program m ed 
from  C m ust convert C source code into a set o f  configuration contexts for the target arch i­
tecture. The approach used here was to leverage existing re-targetable com piler technology to 
com pile the C code into an interm ediate representation , w hich a separate tool— the scheduler—  
then converts into configuration contexts.
T he in term ediate representation chosen here was a R ISC -like assem bly, consisting  o f  instruc­
tions grouped into basic blocks. Each instruction  m atches the functionality  o f  one o f  the cell 
types in the target architecture. Instruction operands are registers, w hich correspond to physical 
registers in the target architecture.
W orking from  assem bly has the benefit o f  allow ing conventional re-targetable com piler tech­
nology to be leveraged w ith little m odification. However, this approach produces a serialisation 
w ith very low  parallel efficiency (suffering from  the ILP wall [33]), m aking little use o f the 
available resources in a reconfigurable com puting architecture. Parallélisation is perform ed by 
the stand-alone scheduler tool.
Parallélisation  involves reconstructing the data flow graph (D FG ) from  the assem bly instruc­
tions o f each basic block, replacing the internal uses o f  registers w ith w ires. Som e o f  these w ires 
becom e registers again, if  that connection has to span the boundary betw een different configu­
ration contexts, w hich will happen if  there are insufficient resources to m ap all instructions into 
a single configuration context. This m apping is perform ed by a scheduling algorithm .
A nother dow n-side o f w orking from  the assem bly is that high-level in form ation is not available—  
all inform ation m ust be extracted from  ju s t the instructions. T he assem bly is only able to ex­
press w hen a register receives a new value; it is not able to  express w hen its current value 
becom es irrelevant. The com piler’s use o f registers to pass inform ation betw een the operands 
o f  instructions inside a basic block m ean that an excessive num ber o f  registers are w ritten to.
Parallélisation on a resource constrained core requires splitting data paths over m ultiple con­
figuration contexts, w hich requires an increased num ber o f  registers to store the values o f  the 
connections that span each boundary. This increase is due to the fact that several o f  the broken 
connections will correspond to the sam e register in the assem bly, but at different tim es.4 These 
m ust be assigned a separate physical register each. R egister starvation w ould norm ally lead to 
a  failure to parallelise. In order to avoid register starvation, it is im portant to determ ine w hich 
registers really store im portant inform ation.




Correctness: The configuration contexts for the target architecture must exhibit the sam e ex­
ternal state change behaviour as the original code (assem bly).
Efficiency: The total execution tim e o f the configuration contexts should be as low as possible.
6.2.1.2 Objectives
•  Devise a data model that can describe a  w ide range of target architectures, in a m anner 
that allows for easy static analysis.
® Devise a series o f algorithm s that operate on this data m odel, to efficiently transform  
basic blocks into valid configuration contexts.
M ost o f  the w ork presented on the scheduler focuses on m axim ising this m apping efficiency, in 
terms o f num ber o f configuration contexts produced, total critical path, throughput, and register 
activity.
6.2.1.3 Novelty
A form  o f list scheduling was devised that focussed on packing data paths into as few steps 
as possible. The side effect o f this packing is to split m ore data paths across step boundaries, 
which increases the dem and for registers. For cores w ith a very lim ited num ber o f  registers, this 
can lead to register starvation, which reduces parallel efficiency. A series o f  algorithm s w ere 
devised to avoid this (register starvation avoidance, section 4.10), w ith m inim al im pact on the 
efficiency o f the resulting schedule.
Furtherm ore, a series o f optim isation and analysis passes w ere presented that im prove the 
scheduling efficiency (live register identification, section 4.7), and aid the routing tool to achieve 
a more optim um  allocation over the w hole program  (global register reallocation, section 4.12). 
This improves routability and reduces com binatorial delay, thus im proving throughput.
6.2.2 Scheduling: Demonstrated Outcomes and Contribution to Knowledge
The results for applying the proposed tree follow er scheduling algorithm  (section 4.9 on page 87) 
to exam ples that are constrained in term s o f com putation resources (section 4.13.1 on page 117) 
show that the scheduling algorithm  was able to always m eet the theoretical m inim um  step count, 
as was the design intent. However, despite m inim ising the step count, the tree follow er schedul­
ing algorithm  can produce sub-optimal schedules in term s o f total critical path5. This is due to 
the visitation order: when a particular resource is constrained, the choice o f  which uses o f  that 
resource are scheduled in each step should depend on where these operations occur in term s 
o f position in the original data flow graph. To m inim ise total critical path, the operations that 
map to the constrained resource type should be scheduled in ascending order o f  position in the 
original data flow graph.
d h e  sum of the critical paths o f  each step produced.
190
C onclusions
T he tree follow er does not ensure this visitation order— instead, it prefers to  continue up a data 
path arm , rather than move to others w ith sim ilar geom etry. If  the arm  contains m ore than 
one operation o f  the constrained type in sequence, then precedence is given to these dependent 
operations, instead o f  to the independent operations in o ther arm s. The arm s end up w ith less 
overlap in the resulting schedule, w hich extends the critical path. A m obility-based scheduling 
algorithm  should exhibit d ifferent behaviour in this regard, preferring to schedule the opera­
tions in order o f  their overall position in the data  flow graph. M obility-based list scheduling 
algorithm s are less com putationally  intensive than the tree follow er algorithm  proposed here.
Ideally the scheduling algorithm  w ould have been com pared to existing alternatives such as 
m obility-based list scheduling[57], how ever this was not possib le w ithin the available tim e, as 
the code base was built around the tree follow er concept, and would need significant alteration. 
The tree follow er was originally  intended to w ork around certain  hardw are constrain ts that w ere 
difficult to deal w ith through list scheduling; however, these w ere la ter rem oved. T he hardw are 
had evolved a lot since the previous published w ork on list scheduling, and the tool created 
from  that w ork was tailored tow ards a particular hardw are design and selection o f  cell types. 
This m ade the two tools im possib le to com pare directly. This w ould be a useful focus for som e 
future work.
Section 4.7 on page 75 proposed an algorithm  for determ ining w hich registers nam ed in the 
assem bly contain im portant data across the boundaries betw een basic blocks, by exam ining 
the program  control flow graph (CFG). The CFG  is reconstructed from  the assem bly using 
another algorithm  proposed in section 4.6 on page 72. The effect o f  this is dem onstrated  in 
section 4.13.2  on page 122, w here it can be seen to increase the available register pool by many 
tim es. M ore im portantly, the advantage increases w ith core size (i.e. total num ber o f registers). 
For a  given core size, this m eans that few er (if any) registers need to be reserved for scratch, 
w hich gives the com piler m ore room  to produce larger basic blocks, w hich in turn are easier 
to parallelise. It also frees up registers fo r use in assem bly-level optim isations, such as the 
conversion o f  stack-local variables (m em ory accesses) to registers.
The processing tim e and m em ory requirem ents o f  the live register identification algorithm  scale 
exponentially  w ith program  com plexity. Im age signal processing applications like the exam ples 
in sections 4.13 and 5.8 take a few  seconds to process, w hilst program s o f  the size o f  the H .264 
decoder can take a m inute or m ore. This is acceptable for the dom ains targeted by the target 
architecture, but does not bode w ell for desktop applications. However, this can be addressed 
by breaking the program  up into separate units that can be analysed individually  (e.g. at the 
function level o r com pilation unit level), w hich m akes the processing tim e scale linearly with 
the num ber o f units. The boundaries betw een these units would be treated  in the sam e m anner 
as if  live register identification had not been done. This com prom ises the extent to w hich 
registers are m ade available, but overall will still lead to a significant im provem ent com pared 
to not perform ing live register identification at all.
191
Conclusions
In a core that is both com putation resource-starved and register-starved, the scheduling algo­
rithm was shown in section 4.13.3 on page 125 to produce a valid schedule in all cases, w ith 
varying use o f the three im plem ented forms of register starvation avoidance ( ‘rew ind’, ‘shuf­
fle’, and ‘split’— section 4.10 on page 95). Even in the very extrem e cases tested— w ith a 50%  
reduction in available registers— the total critical path increased by a m axim um  o f 17%, lead­
ing to an 11% reduction in throughput. This com bination o f technologies m akes it possible 
for the scheduling algorithm  to achieve significantly increased parallelism  even in very highly 
constrained cores, avoiding the need to revert to using the serialisation given in the assem bly.
As the core size increases, the allocation o f cells to physical resources becom es increasingly 
significant, as the m axim um  possible distance betw een two arbitrarily placed cells increases. 
Reallocation can be used to bring the end points o f connections closer together. H ow ever cells 
that m aintain internal state can only be reallocated globally. Section 4.12 on page 109 pre­
sented an algorithm  for tracking inform ation being passed between the steps o f a program  via 
registers. This know ledge can be used to decouple registers between steps, allow ing them  to 
be reallocated m ore freely. One particular use o f this— called register renaming— was dem on­
strated in section 4.13.4 on page 130 on a large core, resulting in a 30%  reduction in average 
path length and interconnect usage. This should allow m ore com plex steps to be routed, and 
im proves the perform ance o f sim pler steps, by reducing the critical path.
6.2.3 Scheduling: Further Work
Im proved scheduling algorithm : Im plem ent a m obility-based scheduling algorithm , using the 
register starvation avoidance m ethods described in this thesis. The rew ind m ethod w ould al­
ter the definition o f  the ready list according to previous attem pts. T he shuffle m ethod would 
rearrange the order o f operations o f equal standing in the ready list.
Scheduling o f data: A utom ating the mapping o f variables and arrays to registers and non- 
uniform  m em ory such as stream  buffers and em bedded register files [100]. This will require 
changes to the com piler, since such inform ation is not available from  the assembly.
A lternative interm ediate representation: M any o f the issues o f  w orking from  assem bly can be 
avoided by m oving to higher-level internal representations inside the com piler, e.g. T reeSSA  / 
GIM PLE [101, 102] in GCC, or the LLVM interm ediate representation [60],
Place and route: The output o f the scheduler tool proposed in this w ork is in the form  o f an ab­
stract netlist', it describes only which cells are connected together, not how they are connected6. 
This m apping is the task of a separate place and route (m apper) tool, w hich is outside the scope 
o f this thesis. Such a tool has been created [61 ].
Hardening: The abstract netlist can alternatively be used to derive a static connectivity map 
(look-up table) for each active cell in the array, allow ing a hardened  core to be produced. This 
avoids the area and com binatorial delay overhead o f the reconfigurable interconnect, at the 
expense of flexibility— the resulting array is only able to execute the single netlist (program ) 
that it was produced for. This provides an alternative C-to-gates tool flow, with an unusually 
high degree o f silicon re-use.




6.3.1 Pipelining: Problem Description
T he high degree o f  operation chaining available in dynam ically  reconfigurable arrays has the 
advantage o f  being able to break through the ILP  w all [33], by allow ing dependant operations 
to be connected together via wires. This largely avoids the central register file com m on in m i­
croprocessors and their derivatives, and thus avoids the bandw idth constrain ts im posed by it. 
T he dow n-side to  chaining large sequences o f  dependent operations together is that the com ­
binatorial delay o f  the critical path increases, thus hurting  throughput— particularly  in single 
configuration context loops {kernels), w hich are the m ost efficient way to perform  heavy com ­
putation.
Each instruction cell is idle until the output o f each instruction cell on w hich it depends settles, 
and is idle again once its output has settled. It rem ains id le until the next context iteration 
begins. The longer the chain  o f dependent operations, the h igher the fraction  o f  execution 
tim e the cell is idle for. P ipelining these chains o f  dependent operations allow s the critical 
path to be reduced, thus increasing the iteration  rate and thus the throughput. T he situation is 
com plicated  by the fact that each instruction cell type has a different com binatorial delay, as do 
the interconnect paths.
6.3.1.1 A im s
•  A utom atically  pipeline com pute-intensive loops to significantly increase throughput.
6.3.1.2 O bjectives
•  A utom atic p ipeline stage assignm ent, based on a user-supplied target critical path con­
straint.
•  M inim al hardw are changes.
•  M inim ising the im pact o f p ipelin ing on the context configuration size.
•  M inim ising the im pact o f p ipelining on the overall program  size.
•  A utom ating the choice o f target critical path.
6.3.2 Pipelining: Demonstrated Outcomes and Contribution to Knowledge
Structural-level pipelining techniques w ere applied via softw are to rapidly reconfigurable /  p ro­
gram m able architectures supporting operation-chaining, w here com plete kernels are m apped 
into a single configuration context/cycle. This im proves throughput by reducing the critical 
path length of the looping kernel.
193
Conclusions
Furtherm ore, this work introduced the novel idea o f achieving pipeline filling and flushing 
through dynam ic reconfiguration (m ulti-step pipelining), in a m anner sim ilar to that used in 
software pipelining. This has the effect o f sim plifying the design o f  the pipelined kernel, re­
moving the need for additional control logic to be added to initialise the pipeline, thus requiring 
no changes to the existing hardware. The addition o f  pipelining in this m anner was show n to 
increase the register requirem ent, and uses m ore program  m em ory to store the additional con­
figuration contexts (prologue and epilogue). This approach however was found not to be very 
scalable, as the overhead in term s o f  additional configuration contexts w ould quickly dom inate 
the program  size. This makes m ulti-step pipelining only suitable to  relatively shallow  pipelines 
(e.g. up to about 10 stages), w hich is appropriate for small cores w ith up to a  few hundred cells.
An im provem ent on this (single-step pipelining), requiring som e hardw are changes, was inves­
tigated. W ith additional constraints im posed on the pipeline stage assignm ent, it is possible 
to strip away m ost o f the additional configuration inform ation, leaving a single configuration 
context that perform s all the pipeline execution phases: fill, loop, and flush. T he only over­
head is a few bits o f configuration data per configuration context, specifying the num ber of 
pipeline stages that exist in that step. The hardw are overhead is very sm all— an additional 
counter (pipeline depth counter), plus a 2-bit state signal (execution phase) broadcast to  the 
instruction cells in the array that maintain state (except for registers). This typically represents 
a very small fraction o f the cells in the core. The novelty in this approach lies in the algorithm ic 
w ork-arounds that alter the pipeline geometry, in order to allow deep pipelining to be possib le 
with such minimal hardw are additions. The lim itation o f this approach is an increase in pipeline 
stage registers required to work around the changes in the pipeline geom etry (i.e. operations 
w ith side effects m ust be placed in the first or last pipeline stage, and extra cells are som etim es 
needed to supply the initial value o f certain kernel registers). T he cell mix therefore has to 
be chosen to ensure sufficient register availability for the given core size. By placing registers 
in the interconnect instead o f in cells, single-step pipelining becom es inherently scalable, as 
registers will always be available along each connection in the data flow graph w hen m apped 
to the array, and their availability scales with the length of the connection (which itse lf largely 
determ ines the num ber o f pipeline stages needed along the connection).
In both approaches, the potential throughput is lim ited by the num ber o f  registers available for 
use in connecting the pipeline stages, and by the presence o f feedback loops that dem and single 
cycle latency (e.g. when updating the value o f a register).
The pipelining algorithm s were applied to a sim ple dem osaic filter for a variety o f  target 
throughput constraints, and achieved a m axim um  throughput o f nearly ten tim es that o f  the o rig ­
inal kernel. The technique was shown to also improve the throughput for applications w here 
the kernel perform s only a small num ber of iterations— particularly when using the single-step 
pipelining approach, which elim inates the step loading time for each fill and flush step. A two 
pass 8x8 2-D DC/T was used to dem onstrate this: the 8 elem ent 1-D DCT kernel perform s 
only 8 iterations in each pass (rows and colum ns), and yet a speed-up o f 35% was shown to be 
possible (section 5.8.1 on page 165).
194
C onclusions
In general, p ipelin ing is able to take a kernel o f  arbitrary  critical path, and reduce it to the 
m inim um  critical path determ ined by non-pipelineable data paths (w hich is a constan t fo r a 
given core). T he m axim um  achievable speed-up therefore scales w ith kernel com plexity, so 
long as sufficient registers are available, and so long as the kernel perform s sufficient iterations 
over w hich to am ortise the im pact o f  the additional fill and flush iterations.
T he pipelin ing algorithm  also allow s for the use o f in ternally  pipelined cells, w hich are useful 
for reducing the critical path o f  a pipelined kernel, and for m aking com binatorial operations 
synchronous. A com m on use o f  this is to hide m em ory latency. This w as dem onstrated  w ith 
a gam m a correction m odule w hich perform s table look-ups requiring  very high m em ory band­
width. Pipelining was shown to increase the throughput by up to nearly  six tim es (42M Pix- 
els/s —> 242M Pixels/s) using com binatorial m em ory reads, and nine tim es (42M Pixels/s —> 
379M Pixels/s) using internally  pipelined m em ory reads (section 5.8.2 on page 174).
This w ork concentrated on reconfigurable instruction cell processors, w hich support a high 
degree o f  operation chaining. However, the sam e techniques could  be applied  to other quite 
d ifferent architectures that support operation chaining, such as upcom ing V L IW /U L IW  proces­
sors [31J.
Furtherm ore, this w ork proposed an algorithm  for autom atically  applying dynam ic structural- 
level p ipelin ing to  single configuration context kernels running on dynam ically  reconfigurable 
arrays (DR A s). The technique is a form  o f feedback directed optim isation , w here profiling 
inform ation (consecutive execution counts) are used to determ ine w hich kernels will benefit 
from  pipelining. C andidates w ith very low  consecutive execution counts m ust not be p ipelined 
too deeply. This is to ensure that the additional latency o f  pipeline filling and flushing is m ore 
than nullified by the decrease in total execution tim e fo r the pipelined kernel loop when the 
pipeline is full. This is only possible when the m inim um  possib le iteration count is known. 
This is the case for pixel-level kernels in  the ISP application  dom ain, as the iteration count is 
typically  the line size o f  the im age.
An iterative approach is used to form  an efficient pipeline, w here the tim ing constra in t is auto­
m atically  chosen to be an integer m ultiple o f the m aster clock frequency. The tim ing constraint 
is increm ented until a valid p ipeline can be constructed w ithout encountering  register starva­
tion. T he range o f  possib le pipeline geom etries is contro lled  by the availability o f  registers. 
A rchitectures w ith distributed registers will offer the best results, o therw ise the bandw idth  o f  
the interface and/or additional com binatorial delays in troduced by routing to and from  a register 
file w ould likely outw eigh any benefit. This m akes the case for registers to be m ade available 
in the interconnect itself.
T he algorithm  was applied to a dem osaic kernel o f m odest com plexity  and to a softw are d i­
vision algorithm , leading to the possibility  to p ipeline to a significant depth. A  perform ance 
increase o f  up to 7 tim es can be obtained for the dem osaic exam ple, and nearly  10 tim es for 
the division (section 5.8.3 on page 179). As the pipeline gets deeper, the cost— in term s o f 
register requirem ent and storage for pipeline filling and flushing contexts— increases m ore than 
linearly. As the critical path o f the pipelined kernel gets sm aller, the quantisation  o f  the iteration 
rate caused by the m aster clock, gets increasingly  worse. Inside the bounds o f  this quantisation, 
reducing the p ipeline critical path7 has no effect on the iteration rate. In these situations, extra
7by increasing the num ber o f  pipeline stages.
195
Conclusions
resources would be introduced for no benefit. To avoid this, the proposed algorithm  relaxes the 
critical path to take into account this quantisation, thus m inim ising the resource requirem ents 
for a given physically achievable iteration rate.
6.3.3 Pipelining: Further Work
Re-timing: To allow infinite im pulse response filters to be pipelined (to som e extent). R egis­
ters that form  the feedback loop(s) m ust be re-used as pipeline stage registers. This could be 
m odelled by adding constraints betw een operations on either side o f a feedback register forcing 
them to be one pipeline stage apart, then m odifying the pipeline stage register assignm ent pass 
to use the existing registers. This is sim ilar to the m odifications that w ere added to support 
internally pipelined cells.
Post-routing pipelining: As w ith the rest o f the w ork presented in this thesis, p ipelining is per­
formed on abstract netlists, w here the path length inform ation is not yet know n. As a result, 
pipelining is perform ed prior to routing, based on a uniform  path length estim ate. A lthough 
this certainly leads to higher real-life throughputs, it may not be optim al, due to variance in 
the path lengths. This problem  increases with array size. To com pensate for this, large arrays 
are proposed to have registers available in the interconnect (sboxes). W hen this is available, 
the alternative approach is to perform  pipelining after routing. This involves using the existing 
pipeline stage assignm ent algorithm  but using the real path length inform ation. P ipeline stage 
register assignm ent is different: the pipeline stage inform ation is used to determ ine how many 
delay elem ents (registers) to enable along each path, and a new algorithm  decides w hich reg­
isters along that path should be enabled, in order to best m eet the target critical path length. 
Hints m ust be provided to the routing tool to ensure that certain paths are long enough (e.g. 
those leading to output registers), and others (e.g. non-pipelineable data paths) are as short as 
possible. For best results, m ultiple routing/pipelining iterations m ight be necessary. This has 




T he w ork presented in this thesis has been incorporated into a com m ercial tool set fo r the 
RIC A  architecture. T he com bination o f  proposed tools and algorithm s provide a high degree 
o f design autom ation— going from  high-level C code to a given target architecture. D esign 
space exploration currently  rem ains a m anual task, but the flexibility o f  the tools m akes this 
quite easy. C ores o f  varying sizes and resource mix can be tested w ith ju s t a change in a text 
file— M achine D escription F ile (M D F)— describing those param eters. A resulting  configura­
tion and perform ance m etrics can be obtained by sim ply re-running the tools. T he addition 
o f  new  functionality— e.g. new  cell types or hardw are— is m ore involving, requiring  that the 
com piler be m odified to support the new  instruction, and the em ulator be m odified to  support 
the functionality  o f that new resource.
T he design prem ise o f going from  C to reconfigurable hardw are is to allow  easier program ­
m ing o f h igh-throughput architectures. The pure von N eum an program m ing m odel— w here 
algorithm s are expressed im peratively, operating on a single shared data m em ory— offers the 
sim plest way o f  describ ing algorithm s. Flowever, to  get m axim um  perform ance on large ar­
rays, som e degree o f  custom isation o f the C code is required , straying away from  this model. 
This is because the current com piler is unable to deal w ith non-uniform  m em ory architectures 
(N U M A ). The scheduler is currently  able to w ork around this to som e extent, by  converting 
stack local variables to registers, w here possible. However, w ithout high-level inform ation 
defining how the various variables and arrays are used in a program , the scheduler is unable 
to perform  m ore com plex transform s such as m apping arrays to stream  buffers. T here is en ­
couraging research in these directions how ever [103], especially  w ithin the LLVM  com m unity 
[104], so this should be possible to provide in the near future. However, there is only so much 







* Created by Mark Muir on 2008-06-24
* Example program with a single-step kernel demonstrating several independent
* chains of dependent operations running in parallel. This is to aid in the









for (i=0; i<DATASET_SIZE; i++) 
input[i] = i;
for (j=0; j<REPEAT; j++)
{
/* Perform a set of operations in parallel. */ 
for (i=0; i<DATASET_SIZE; i+=4)
{
int result[4];
result[0] = (input[i+0] & i) * (i<<l);
result[1] = (input[i+1] & i) * (i<<2);
result [2] = (input[i+2] & i) * (i<<3);
result[3] = (input [i + 3] & i) * (i<<4);
/* Write the results after reading all values, to avoid memory 
access conflicts. */ 
output[i+0] = result[0]; 
output[i+1] = result [1] 




Figure A .l: C source code for the example with lour copies of the data path executing indepen­
dently, in parallel.
199
Em ulator Test Programs
Stcp[ l.Parallcl_s_L5\6)|
(critical path: 834ns)
Figure A.2: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the ‘parallel’ example program shown in figure A.I. Generated by the RICA tools.
200




* Created by Mark Muir on 2008-06-24
*
* Example program with a single-step kernel demonstrating long chains of
* dependent operations. This is to aid in the comparison between the









for (i=0; i<DATASET_SIZE; i++) 
input[i] = i;





Perform a set of operations with similar cell resource 
requirements, in series. To keep the memory activity (WMEM 
count) the same as the Parallel example, the same value is 
written to four consecutive addresses. */
: (i=0; i<DATASET_SIZE; i+=4)
int result [ 4];
result[0] = ( (input[i + 0] & i) << i) ;
result [1] = ((input[i + 1] & i) << i);
result[2] = ((input[i+2] & i) << i);
result [3] = ( (input[i + 3] & i) << i) ;
/* Pass the temporary results through memory, to prevent 
the compiler from re-ordering the multiplication a*b*c*d 
to (a*b)* (c*d) instead of the desired ((a*b)*c)*d. */ 
output[i+0] = result[0] * i;
output[i + 1] = result [1] * output[i + 0];
output[i+2] = result[2] * output[i+1];
output[i + 3] = result [3] * output[i+2];
return 0;
}
Figure A.3: C source code for the example with two copies of the data path executing indepen­
dently in parallel, with another two copies of the data path dependent on these (thus 
extending the critical path).
201
E m ulator Test Programs
const_14b[4]co n st. I4b[l Ico n st. I4b |0]
const_14b[5|addcomp[21 rmem[0]rm em p]
logic[3| ) C const_14b[6| j  ( addcomp[l
const_32b[0]shift [2] shiftl3] addcomp[0|
addcomp[3J jumpLO]
in ( m ull2| co n st. I4b[3|





Figure A.4: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the ‘combinatorial’ example program shown in figure A.3. Generated by the RICA 
tools.
202




* Created by Mark Muir on 2008-06-24
*
* Example program with a single-step kernel demonstrating the same independent
* chains of dependent operations as the Parallel example, but with the
* independent chains placed in different steps (by being in different loop
* iterations). This is to aid in the comparison between the execution times of









for (i=0; i<DATASET_SIZE; i++) 
input [i] = i;
for (j=0; j<REPEAT; j++)
{
/* Perform an arbitrary operation on each member of the data set. The 
resuls of the operation will be different to the Parallel example, 
since the value of 'i' will be different in some iterations. 
Compensating for this would change the complexity of the operation. */ 
for (i=0; i<DATASET_SIZE; i++)
{





Figure A.5: C source code for the example with the data path executed inside a loop (which hasn’t 
been unrolled), causing the main loop to consist of four iterations of the same config­
uration context executing in sequence.
203
Em ulator Test Programs
Step[LSequential_s_L5’(6)] 
(critical path: 8.34ns)
Figure A.6: Data flow graph for the configuration context corresponding to the main loop (kernel) 




Live Register Identification Algorithm
Trace
205
L ive R eg ister Identification A lgorithm  Trace
CFG edge Action Registers live on 
exit from caller
NULL —* .reset visit _main N/A
.reset —* .main visit LI -
_main —► LI visit Ju n e -
LI —► _func visit L4 -
_func —> L4 visit L4 -
L4 — L4 visit L4
L4 ->  L4 already considered - + r3,r4,r6 + -
=4 update and return = r3, r4, r6
L4 -> L4 visit ret2 r3, r4, r6
L4 - > ret2 visit L2 r3, r4, r6
ret2 ->  L2 visit Ju n e -
L2 ->  Ju n e visit L4
J u n e  —> L4 already considered - + r3,r4,r6 + r3,r4,r6
=4 update and return = r3, r4, r6
L2 —► Ju n e all targets visited - + rl,r2,r6 + r3,r4,r6
=4 update and return = r l ,  r2, r3, r4, r6
ret2 ->  L2 all targets visited - + r5,rll + rl,r2,r6
=4 update and return = r l , r2 ,  i-5, r6, rl 1
L4 -+  ret2 visit L3 r3, r4, r6
ret2 — L3 visit retl r 5 , r l l
L3 ->  ret 1 visit _end -
retl —► -end no targets
=4 update and return
L3 ->  rell all targets visited - + rl,r9 + -
=4 update and return = r l , r9
ret2 -► L3 visit LI r5, r l 1
L3 -> LI visit Ju n e r l ,  r9
LI ->  Ju n e already considered - + rl,r2,r6 + r3,r4,r6
=4 update and return = r l , r2, r3, r4, r6
L3 ->  LI all targets visited r l ,  r9 + r5 + rl,r2,r6
=4 update and return = r l ,  r2, r5, r6, r9
ret2 -*• L3 all targets visited r5. r l l  + r5,rl 1 + rl,r2.r5,r9
=4 update and return = r L r2 , r5, r9, r l l
L4 —> ret2 all targets visited r3, r4, r6 + rl,r3,r9 + rl,r5,r9
=4 update and return = r l , r3, r4, r5, r6, r9
L4 L4 all targets visited rl ,  r3, r4, r5, r6, r9 + r3,r4,r6 + rl,r3,r4,r5,r6,r9
=4 update and return = r l ,  r3, r4, r5, r6, r9
June  —> L4 all targets visited r3, r4, r6 + r3,r4,r6 + rl,r3 ,r4 ,r5 ,r6 ,r9
=4 update and return = r l ,  r3, r4, r5, r6, r9
LI —> Ju n e all targets visited r l ,  r2, r3, r4, r6 + rl,r2 ,r6  + rl,r3 ,r4 ,r5 ,r6 ,r9
=4- update and return = r l .  r2, r3, r4, r5, r6, r9
_main —► LI all targets visited - + r5 + rl,r2,r5,r6
=4 update and return = r l ,  r2, r5, r6
j'csel —*■ _main all targets visited - + r l ,r2 ,r9  + r l ,r2 ,r6
=4 update and return = r l ,  i-2, r6. r9
NULL -> .reset all targets visited 
=4 update and return
N/A
information changed =4 traverse again
Continued in table B.2 ...
Table B .l: Trace of the CFG walk for the example in figure 4.T8 on page 77, using information in 
table 4.4 on page 79. The ‘update' of the record of registers live on exit from the caller 
(LHS) consists of adding all the input registers of the callee (RHS), plus any registers 
live on exit from the callee (RHS) that weren't clobbered in the callee (RHS). Newly 
added registers resulting from the update are shown in bold.
206
L ive  R eg ister Identification A lgorithm  Trace
CFG edge Action Registers live on 
exit from caller
. . .  Continued from table B. I
NULL —* .reset visit .m ain N/A
.reset —*• .m ain visit LI r l ,  r2, r6, r9
.m ain —> LI visit _func r l ,  r2, r5, r6
LI —> _func visit L4 r l ,  r2, r3, r4, r5, r6, r9
_func —► L4 visit L4 r l , r3, r4, r5, r6, r9
L4 —> L4 visit L4 r l ,  r3, r4, r5, r6, r9
r -P̂ 1 -pi already considered r l ,  r3, r4, r5, r6, r9 + r3,r4,r6 + rl,r4 ,r5 ,r9
=> update and return = r l , r3, r4, r5, r6, r9
L4 ->  L4 visit ret2 r l ,  r3, r4, r5, r6, r9
L4 —  ret2 visit L2 r l ,  r3, r4, r5, r6, r9
ret2 -»  L2 visit Ju n e r i ,  r2, r5, r9, r l  1
L2 — _func visit L4 r l ,  r2, r3, r4, r6
_func —> L4 already considered r l , r3, r4, r5, r6, r9 + r3,r4,r6 + r l  ,r3,i*5,r6,r9
=>• update and return =  r l , r3, r4, r5, r6, r9
L2 —> J u n e all targets visited r l ,  r2, r3, r4, r6 + rl,r2 ,r6  + rl,r3 ,r4 ,r5 ,r6 ,r9
=> update and return = r l ,  r2, r3, r4, r6, r9
ret2 ->  L2 all targets visited r l ,  r2, r5, r9, rl 1 + r5,rl 1 + rl,r2 ,r6
update and return = r l ,  r2, r5, r6, r9, r l 1
L4 - t  ret2 visit L3 r l , r 3 , r 4 ,  r5, r6, r9
ret2 —> L3 visit ret 1 r l ,  r2, r5, r6, r9, rl 1
L3 ->  reti visit .end r l , r2, r5, r6, r9
reti —> _end no targets - + - + -
=> update and return = -
L3 —> reti all targets visited r l , r2, r5, r6, r9 + rl ,r9 + -
=> update and return = r l ,  r2, r5, r6, r9
ret2 -*  L3 visit LI r l ,  r2, r5, r6, r9, r l 1
L3 —» LI visit Ju n e r l ,  r2, r5 ,r6 , r9
L 1 —» Ju n e already considered r l ,  r2, r3, r4, r5, r6, r9 + rl,r2 ,r6  + rl,r3 ,r4 ,r5 ,r6 ,r9
=£> update and return = r l , r2, r3, r4, r5, r6, r9
L3 -*  LI all targets visited r l , r2, r5, r6, r9 + r5 + rl ,r2,r5,r6
=4> update and return = r l , r 2 ,  r5, r6, r9
ret2 ->  L3 all targets visited r l ,  r2, r5, r6, r9, rl 1 + r5,rl 1 + rl,r2 ,r5 ,r9
=>• update and return = r i ,  r2, r5, r6, r9, r l  1
L4 — ret2 all targets visited r l , r3, r4, r5, r6, r9 + rl ,r3,r9 + r 1 ,r5,r9
=>• update and return = r l , r3, r4, r5, r6. r9
L4 ->  L4 all targets visited r l ,  r3, r4, r5, r6, r9 + r3,r4,r6 + rl,r3 ,r4 ,r5 ,r6 ,r9
=> update and return = r l ,  r3, r4, i*5, r6, r9
5 0 1 r -Pi all targets visited r l ,  r3, r4, r5, r6, r9 + r3,r4,r6 + rl,r3 ,r4 ,r5 ,f6 ,r9
=>• update and return = r l ,  r3, r4, r5, r6, r9
LI —+ .fune all targets visited r l ,  i*2, i*3, i*4, r5, i*6, r9 + rl,r2 ,r6  + rl,r3,r4,r5,r6,i*9
=> update and return = r l ,  i*2, r3, i*4, r5, i*6,1*9
.m ain —> LI all targets visited r l , i*2, i*5, i*6 + i*5 + rl,i*2,i*5,i*6
=> update and return = r l , i*2, i*5, i*6
.reset —» .m ain all targets visited r l ,  r2, i*6, r9 + rl,r2 ,r9  + rl,r2 ,r6
=>■ update and return = r l ,  1*2, i*6, r9
NULL —> .reset all targets visited 
update and return
N/A
information changed => traverse again
no new information =£• end





Figure C .l: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the 3x3 demosaic module used in sectioii 5.8.1 on page 165. Generated by the RICA 
tools. Not pipelined. The critical path is shown by a red outline.
209
Pipelining Test Programs
t t m w
Figure C.2: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the 3x3 demosaic module used in section 5.8.1 on page 165. Generated by the RICA 
tools. Maximally pipelined (target 5ns, single-step pipelining). Pipeline stage registers 
are shown with a pink fill. The critj^jl paths are shown by a red outline (clearer in 
the zoomed-in view). Two adjacent pipeline stages share the same critical path; both 
consist of a multiplier between pipeline stage registers.
Pipelin ing Test Programs
Figure C.3: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the DCT example used in section 5.8.1 on page 165. Generated by the RICA tools. 
Not pipelined. The critical path is shown by a red outline.
Figure C.4: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the DCT example used in section 5.8.1 on page 165. Generated by the RICA tools. 
Maximally pipelined (target 7ns, single-step pipelining). Pipeline stage registers are 
shown with a pink fill. The critical path is shown by a red outline.
Pipelining Test Programs
Figure C.5: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the gamma correction module used in section 5.8.2 on page 174. Generated by the 
RICA tools. This version uses combinatorial memory reads (RMEM) for each of the 
12 table look-ups. Not pipelined. The critical path is shown by a red outline.
212
Pipelining Test Programs
Figure C.6: Data flow graph for the configuration context corresponding to the main loop (kernel) 
of the gamma correction module used in section 5.8.2 on page 174. Generated by the 
RICA tools. This version uses combinatorial memory reads (RMEM) for each of the 
12 table look-ups. Maximally pipelined (target 6ns). Pipeline registers are shown with 
a pink fill. The critical paths are shown by a red outline. Note that in this case two 
adjacent pipeline stages happen to have equal critical paths, so the critical path looks 
twice as long as it really is.
213
Pipelining Test Programs
S lcp I’L tb_gam m a_2L U T bascd_s_L 543 S tcp 2 ’(7 ) |
(critical path: 0 2 5 n s )
S tc p l’L tb_gam m a_2L U T bascd_s_L 543 S lcp 3 ’(8)]
(critical path: 0 .25ns)
Figure C.7: Data flow graphs for the configuration contexts corresponding to the main loop of the 
gamma correction module used in section 5.8.2 on page 174. Generated by the RICA 
tools. This version uses internally pipelined (3 stages) memory reads (SRBUF RAM) 
for each of the 12 table look-ups. Not pipelined. The loop has been broken into 4 steps 
to work around the internal pipelining of the cells, where two of the steps have no 
connections (just the s b u f  cells being active, allowing data to propagate internally).
214
P ipelin ing Test Programs
Figure C.8: Data flow graph for the configuration context corresponding to the main loop (ker­
nel) of the gamma correction module used in section 5.8.2 on page 174. Generated 
by the RICA tools. This version uses internally pipelined (3 stages) memory reads 
(SRBUF_RAM) for each of the 12 table look-ups. Maximally pipelined (target 5ns). 







Automated Dynamic Throughput-constrained Structural-level Pipelining in
Streaming Applications
Mark M u ir (1) Tughrul Arslan Iain Lindsay ^
1 The Universtiy of Edinburgh 
Mayfield Road. Edinburgh, EH9 3JL 
United Kingdom 
M ark.M uir@ ed.ac.uk
A b strac t
Stream processing applications such as image signal 
processing dem and high throughput. However, custom ers 
increasingly dem and runtime flexib ility in their designs, 
which cannot be provided by custom A SIC  solutions. Cur­
rently, reconfigurable processors tend to offer insufficient 
throughput fo r  widespread use in streaming applications. 
This pap er dem onstrates how structural-level pipelining  
techniques can be applied to rapidly dynam ically recon­
figurable com puting architectures, in order to increase 
throughput. This is done by autom atically inserting regis­
ters into the data path  o f  perform ance critical code sections  
that have already been optim ised into a single configura­
tion context. A new  algorithm  is presented to choose the 
insertion poin t o f  p ipeline stage registers in order to m eet 
a specified throughput w hilst m inim ising register resource 
usage. The pap er then dem onstrates a new approach where 
properties o f  dynam ic reconfiguration can be utilised to p e r­
form  the tasks o f  pipeline stage initialisation and flushing. 
The technique is dem onstrated on a real-life application: 
the dem osaic filte r  in a standard image signal processing  
pipe used in modern digital cameras, and can be seen to 
boost the throughput from  K M P ixels/s to 5IM P ixels/s on 
an exam ple reconfigurable processor.
1. Introduction
The choice o f  platform  for many modern digital sig­
nal processing tasks in em bedded systems is often limited 
to application-specific integrated circuits (ASICs), since 
off-the-shelf program m able architectures such as DSPs 
and m icroprocessors cannot meet the throughput require­
ments, whereas reconfigurable hardware such as field- 
program m able gate arrays (FPGAs) require too m uch area 
and power. However, for applications that dem and an e le­
ment o f  reprogram m ability, stream ing processors (such as 
those offered by Ambric 11] and SP1 [2]) are becom ing an 
increasingly attractive solution, which improve on through­
put by providing m ultiple processing elem ents/cores with
2 Institute for System Level Integration 
Alba Centre, L ivingston, EH54 7EG 
United Kingdom
an interconnect structure suited to stream ing. However, 
these processing elem ents— usually based on regular DSP 
designs— often equate to significant silicon area. C oarse­
grained dynam icaly reconfigurable architectures (DRAs) 
offer a high degree o f  parallelism , sufficient to achieve high 
throughput [3][4], Thus few er cores are required for a given 
application, leading to a m uch lower area overhead. T hese 
coarse-grained architectures are reconfigured very rapidly 
(e.g. m illions o f  tim es per second), in order to achieve con­
trol flow sim ilar to a regular m icroprocessor. T his paper fo­
cuses on m axim ising the perform ance o f  program s running 
on a single core. However, the techniques can be directly 
applied to program s running on additional cores in a com ­
plete stream ing application.
Coarse-grained DRAs, such as instruction cell based 
com puting architectures [5] [6], provide a high degree o f 
instruction chaining inside the core, by allow ing arbitrary 
connections to be m ade betw een the various functional units 
via a configurable routing network. This allows quite com ­
plex data paths to be rendered onto the fabric and executed 
in a single configuration. This m akes these architectures 
particularly suitable to stream  processing, as few er fetches 
from  program  m em ory are required. Perform ance is opti­
mised by attem pting to m atch the size o f  each kernel (in­
ner loops w here m ost o f  the execution tim e is spent) to the 
available resources, allow ing them  to fit into a single con­
figuration context. This allows the configuration to persist 
for many clock cycles, operating on new data on each cy­
cle. This increases throughput, since no tim e is spent hav­
ing to reconfigure the core betw een successive iterations. 
It also decreases pow er consum ption, as the configuration 
only needs to be fetched from program  m em ory (or cache) 
once— upon first entering the kernel— rather than on every 
iteration. However, the resulting data paths can often have a 
long critical path, leading to poor tem poral utilisation o f  the 
functional units, since they have to wait until all functional 
units have com pleted before operating on the next batch of 
data, w hich lim its the throughput.
Pipelining provides a  way o f  starting to operate on a new
218
batch o f  data before an old one has com pleted, so that the 
functional units o f  m ultiple stages o f  the kernel can be ac­
tive concurrently ; each operating on a different batch o f 
data. This paper describes how structural-level pipelining 
can be applied dynam ica lly  to architectures that support a 
high degree o f  operation  chaining. T h is is done as part o f 
the configuration— i.e. pipelines tailored to the particular 
kernel are rendered onto the core at run-tim e. This has the 
sam e effect as adding pipelin ing  in hardw are, but can be 
changed at run-tim e. Furtherm ore, these custom  pipelines 
can be  initialised and flushed in separate configuration con­
texts, reducing the resource requirem ent o f the pipelined 
kernel.
Section 2 review s existing p ipelin ing  techniques, and rel­
evant softw are optim isation  techniques. Section 3.1 de­
scribes an algorithm  to perform  p ipeline stage allocation, 
and section 3.2 show s how properties o f  dynam ic reconfig­
uration can be  used to fill and flush the resulting pipeline. 
Section 4 show s the result o f  applying this technique to a 
real-life kernel used in im age processing.
2. Previous work
For architectures that support instruction chaining, 
scheduling involves m apping as m any dependent and in­
dependent data paths into as few configuration contexts as 
possible [7]. Independent data paths run in parallel, so the 
tim e for w hich a configuration persists is determ ined by the 
m axim um  critical path length o f these data paths. If  suffi­
cient functional unit resources are available, loops can be 
optim ised  by loop unrolling [8]— i.e. placing m ultiple it­
erations as independent data paths in the sam e configura­
tion. T his allows m ultiple iterations to begin and end at 
once. This does not change the original critical path length, 
yet can increase the throughput. T he throughput is deter­
m ined by the critical path length o f  a loop iteration and the 
num ber o f  iterations that can be perform ed at once. D ur­
ing each execution o f  the loop configuration context, data 
propagates through the operation chains until the final re­
sult is ready. This m eans that the functional units involved 
in that chain are only perform ing useful w ork for a frac­
tion o f the tim e. This is w here structural-level pipelining of 
these data  paths com es in— to artificially reduce the criti­
cal path length by allow ing new iterations to begin w ithout 
w aiting for the com pletion o f  previous iterations.
Various approaches o f  p ipelin ing  data paths have been 
proposed [9], T hese require that the designer specifies a 
throughput constrain t, in order to allow the algorithm  to best 
m ake the choice betw een throughput and the area overhead 
each p ipeline stage introduces. T hese approaches describe 
various algorithm s for the task o f  pipeline stage allocation, 
applied to a num ber o f  different levels in a design. On re- 
configurable architectures such as FPG A s, custom  pipelines 
can be rendered as part o f  the configuration, leading to sig­
nificant increases in throughput [10].
Perform ing this pipelining dynam ically  as part o f  the 
configuration allow s the throughput o f  a given D RA core 
to be  increased, w ithout reducing its flexibility. A generic 
stream  processing  engine built from  these cores would 
therefore be able to achieve m uch higher throughputs over a 
w ide range o f  stream ing applications. For a given through­
put requirem ent, few er cores arc required with this ap­
proach, w hich reduces the area and also  the com plexity  of 
application developm ent.
3. D ynam ic pipelining
C onventional structural-level pipelin ing  can be applied 
to single configuration context kernels w ith long criti­
cal data paths, in order to reduce the critical path, and 
thus increase throughput. T his is done as part o f  the 
configuration— i.e. pipelines tailored to the particular ker­
nel are rendered  onto the core at runtim e. This is done using 
existing register resources in the core to delay values for a 
single execution cycle, allow ing values to be bridged across 
pipeline stage boundaries.
S tructural pipelin ing  is applied to the kernel basic block 
by first assigning each operation in the original data flow 
graph to a p ipeline stage. T hen, registers are introduced to 
store values over boundaries betw een p ipeline stages. Fig­
ure 1 show s an exam ple kernel before and after structural- 
level pipelining.
3.1. Pipeline stage allocation
First, constraints are defined betw een operations, w here 
the o rder o f  execution is im portant. Exam ples include 
‘sam e stage or earlie r’ constraints betw een operations read ­
ing from  input registers and operations that have those sam e 
registers m arked as global output registers, and ‘sam e stage 
or earlie r’ constraints betw een data m em ory read operations 
and potentially  aliasing data m em ory w rite operations. All 
operations in a feedback chain m ust be  p laced  in the sam e 
pipeline stage, since such chains requ ire  single-step total la­
tency in o rder to keep the p ipeline full.
T he algorithm  is a form  o f  list scheduling. O nly oper­
ations w hose p redecessors (in the data  path) have already 
been assigned a p ipeline stage m ay be considered  for inser­
tion on each pass. In o rder to m inim ise the reg ister count, 
operations should be placed in as late a p ipeline stage as 
possible. O perations that m ust be p laced  in the sam e stage 
are dealt w ith together. O perations are considered for place­
m ent in the latest p ipeline stage contain ing  any o f  their 
predecessors. Then, the insertion point is m oved tow ards 
later p ipeline stages until all constrain ts have been satisfied. 
O nce a valid insertion point has been identified, the critical 
path is calculated for the resulting  (incom plete) configura­
tion context w ith the operation in that pipeline stage. If the 




Ç ) Q P
' (b) '
Stage 3
Figure 1: E xam ple kernel data flow graph, (a) before pipelining, (b) after pipelin ing (kernel loop context). T h e inserted p ipeline stage registers are 
show n in red. The per-cycle critical path is show n in bold, and is shorter in (b), w hich a llow s for a h igher throughput.
in that pipeline stage. O therw ise, the operation is added to 
the next pipeline stage (creating it if  it does not exist).
Once the pipeline stages have been determ ined, pipeline 
stage registers are assigned as follows: for each pipeline 
stage in sequence, assign a new register storing the value 
produced by each operation in all previous pipeline stages 
that need to be stored for use in this or any later stage.
3.2. Dynamic initialisation and clean-up
Normally, a pipelined design would require additional 
logic to take care o f  initialising the pipeline stages, or to 
suppress the operations in later pipeline stages until the pre­
vious stages have filled (predication), so that they do not op­
erate on garbage. However, since the pipelines in a coarse­
grained DRA are them selves rendered as part o f the config­
uration context. Provided that the configuration time is not 
significantly larger than the execution time of each step, dy­
nam ic reconfiguration can be used to render different con­
figurations before the main kernel loop configuration, to fill 
successive stages o f the pipeline, and sim ilarly to flush the 
pipeline after exiting the kernel loop. This allows the ker­
nel loop configuration to assum e that the pipeline stages are 
always full.
Prologue: New configuration contexts arc created to ini­
tially fill each successive stage o f the pipeline. For n  
p ipeline stages, n  -  1 pipeline filling contexts are cre­
ated.
Loop: A single configuration context is created for the 
kernel loop, w hich includes all pipeline stages.
Epilogue: New configuration contexts are created to flush 
successive stages o f  the pipeline. For n  pipeline stages, 
n  -  1 pipeline flushing contexts are created.
Stages 1 Loop 1 Stage 1 Stage 2 
1 and 2  1 \










S3 S3 si 111 S3
il------*
1 1 1 
1 ____  1 v ____)  1 ____  1
1 1
The core is dynam ically reconfigured to first perform  
pipeline initialisation, then reconfigured to execute the ker­
nel loop, then finally reconfigured to flush the pipeline— as 
dem onstrated in figure 2. This is sim ilar to the epilogue and 
prologue in softw are pipelining [11],






Figure 2: Control flow for a 3 -stage pipelined kernel, sh ow in g  which  
stages are active in each context (and m om ent in tim e). E xecution  
flows from  one context to  the next, excep t in the kernel loop , which  
loops back to itse lf (hold ing  the sam e context) until the end condition  
is satisfied.
Figure 2 shows w hich stages o f  the pipeline are active 
during execution for a 3-stage pipeline. As the target ar­
chitectures may not be state free (e.g. m em ory access), 
it is im portant to not allow any operation in any pipeline 
stage to operate on garbage, and to preserve the execu­
tion count. W ith the arrangem ent shown in the figure, all 
pipeline stages will be executed the sam e num ber o f  tim es 
irrespective o f the num ber o f iterations perform ed in the 
kernel loop.
4. A pplication  to s trea m in g
The algorithm  described in this paper was applied to a 
real-life application: a 3-line dem osaic filter [121, w hich in­
220
volves interpolating  m issing colour com ponents from  the 
Bayer output o f  a colour filter array sensor. This is a 
high-throughput task norm ally done on-chip (integrated 
into the sensor) as part o f  a custom  im age signal process­
ing p ipeline, used in m odern digital cam eras and m obile 
phones. T his is typically  the m ost com putationally  inten­
sive part o f  a standard Im age Signal Processor (ISP). The 
filter w as re-im plem ented  on a reconfigurable instruction 
cell-based  processor [5], using the C language. Softw are 
op tim isation  techniques w ere used to reduce the filter ker­
nel into a single basic block, sm all enough to fit onto the 
target architecture in a single configuration context. The 
throughput o f  the resulting  filter is given in table 1.
T he operations o f the resulting kernel w ere then 
p ipelined  using the algorithm s described in this paper, for 
several target critical path lengths (tim ing constraints), the 
results o f  w hich  are also given in table 1.
Target critical path (ns) None 40.0 30.0 20.0 19.0 16.0
Actual critical path (ns) 
Throughput (M Pixels/s) 
































T able 1: P erform ance o f  the dem osaic  filter kernel before pipelin ing, 
and a fter  p ipelin ing. T h rou gh pu t and additional register and program  
m em ory  resou rce requirem ents are show n.
Pipelining can be  seen to increase the throughput, at the 
expense o f  extra  registers, and additional program  m em ory 
for the p rologue and epilogue. The last colum n in table 1 
show s that a natural throughput lim it is reached, determ ined 
in this case by the length o f the feedback chains present in 
the kernel data flow graph. N ote that for a  target o f  19.0ns 
in this case, the resulting critical path is greater than that ob ­
tained for a target o f  16.0ns. T h is represents boundary noise 
in the register stage allocation algorithm , w here depend­
ing on previous choices, a different local m inim um  m ay be 
found.
5. C onclusions
T his paper dem onstrates that structural-level pipelining 
techniques can be applied via softw are to rapidly reconfig- 
urable/program m able architectures supporting operation- 
chaining, w here com plete kernels can be m apped into a sin­
gle configuration context/cycle. This im proves throughput 
by reducing the critical path  length o f  the looping kernel. 
This m akes such architectures ideal candidates for use as 
the cores in a stream  processing engine, as few er cores are 
needed to m eet a particular throughput. This w ork concen­
trated on reconfigurable instruction cell processors, which 
support a high degree o f  operation chaining. However, the 
sam e techniques could be applied to o ther quite d ilferent 
architectures that support operation chaining, such as up­
com ing V LIW /U LIW  processors.
Furtherm ore, this paper introduced the idea o f  achieving 
pipeline filling and flushing through dynam ic reconfigura­
tion, in a m anner sim ilar to that used in softw are pipelining. 
P ipelining was show n to increase the reg ister requirem ent, 
and uses m ore p rogram  m em ory to store the additional con­
figuration contexts (prologue and epilogue). However, the 
program  m em ory overhead can be largely avoidable, since 
the additional contexts arc suited to tem poral com pression. 
T he potential throughput is lim ited by the num ber o f  reg­
isters available fo r use in connecting the p ipeline stages, 
and by the presence o f  feedback loops that dem and single 
cycle latency (e.g. w hen updating the value o f a register). 
The algorithm  was applied to a dem osaic filter for a variety 
o f  target throughput constraints, and achieved a m axim um  
throughput o f  m ore than three tim es that o f the original ker­
nel.
References
[ 11 M . Butts, A. M. Jones, and P. W asson, “A structural object p rogram ­
m ing m odel, arch itec tu re , ch ip  and tools for reconfigurable com put­
ing,” in E C  CM , 2007, pp. 55 -6 4 .
[2 | B. Khailany, T. W illiam s, J. Lin, E. Long, M . Rygh. D. Tovey, and 
W. Daly, “A p rogram m able  512 G O PS stream  processo r fo r signal, 
im age, and v ideo  processing .” in S o lid -S ta te  C ircuits Conference, 
2007, pp. 2 7 2 -6 0 2 .
131 A. M ajor, T. A rlsan, et al., “ H .264 decoder im plem entation  on a d y ­
nam ically  reconfigurable instruction  cell based architecture ,” in In ­
tern a tio n a l S O C  Conference, 2006, pp. 4 9 -5 2 .
|4 ]  Z. K han. T. A rlsan, et a l., “ Im plem entation  o f  a real tim e p ro ­
g ram m able  encoder for low density  p arity  check  code on a recon­
figurable instruction cell arch itecture ," in D esign A u tom ation  C on­
ference , A sia  a n d  S outh  Pacific, 2007, pp. 5 8 3 -5 8 8 .
|5 |  S. K haw am , I. N ousias, M. M ilw ard, Y. Yi, M . M uir, and T. Ar­
slan , "T he reconfigurable instruction  cell array,” IE E E  Transactions 
on Very Large Sca le  In tegration  (V L SI) System s, vol. 16, no. 1, pp. 
1-11 , 2008.
[6 | “ L oosely-b iased  heterogeneous reconfigurable  arrays,” U .S. Patent 
2 0 0 5 0 2 5 7 0 2 4 ,2 0 0 5 .
17 1 Y. Yi and 1. N ousias, “System -level scheduling  on instruction  cell 
based  reconfigurable system s,” in D esign A u tom ation  an d  Test in E u ­
rope, In terna tiona l C onference on , 2006, pp. 3 8 1 -3 8 6 .
|8 |  J. S anchez and A. G onzalez, “The effectiveness o f  loop unrolling  
fo r m odulo  schedu ling  in clustered  VL1W  architectures,” in IC C P  
Parallel Processing, In terna tiona l C onference on, 2000, p. 555.
19 1 S. Bakshi and D. G ajsk i, “ Partition ing  and p ipelin ing  for 
perfo rm ance-constra ined  h ardw are/so ftw are system s,” Very Large  
Scale Integration (V L SI) System s, IE E E  Transactions on , vol. 7, 
no. 4, pp. 4 1 9 -4 3 2 , 1999.
1101 S. S ilva and S. Bam pi, “A rea and throughput trade-offs in the design 
o f  p ipelined d iscrete  w avelet transform  architectures,” in D esign A u ­
tom ation  a n d  Test in Europe. In terna tiona l C onference on , 2005, pp. 
3 2 -3 7 .
1111 M . Lam , “ Softw are p ipelining: an  effective sch ed u lin g  technique for 
V L IW  m achines," in A C M  SIGPLj \N  co n ference on Program m ing  
L anguage design  a n d  Im plem enta tion . New York, NY, USA: ACM 
Press, 1988, pp. 318-32S .
(12] J. M ukherjee, M. M oore, and S. M itra, “C o lo r dem osaicing  w ith co n ­
stra ined  buffering,” in S igna l P rocessing a n d  its A pplica tions, Sixth  
In terna tiona l Sym posium  on, vol. 1, 2001, pp. 5 2 -5 5 .
221
EXTENSIBLE SOFTWARE EMULATOR FOR RECONFIGURABLE 
INSTRUCTION CELL BASED PROCESSORS
Mark M uir1'3, lain Lindsay1, Tughrul A rslan1-2'3, loannis Nouslas1-3, 
Sami Khawam3, Mark M ilward3, Nazish Aslam 2-3, Adam M ajor1 1
1 The University of Edinburgh 
2 Institute of System Level Integration 
3 Spiral Gateway 
contact: mark.muir@ed.ac.uk
ABSTRACT
This paper presents a  novel high-speed behavioural simula­
tor (software-based em ulator) for reeonfigurable instruction 
cell based processors. These architectures are particularly 
suited to  providing low-power, low-cost im plem entations 
of applications in a  stream ing environm ent, such as image 
signal processing, video playback, or base-band signal pro­
cessing. As a  result, m any realistic applications operate on 
very large d a ta  sets, so sim ulation tim e plays a  key role in 
the time to  m arket. T he key aspect of th is work is an ef­
ficient serialisation algorithm  (based on topological sort), 
able to  cap ture  the intricacies of reconfigurable processors 
th a t can be reconfigured very rapidly (ns). T his allows 
for a  new generation of high-speed em ulation models to  be 
constructed. The perform ance of this algorithm  deployed 
in an in terpreter-based model is com pared to o ther simu­
lation techniques. T he em ulator can achieve performance 
around two orders of m agnitude higher than  current event- 
driven software models, and sim ilar to  th a t of an FPG A - 
based model. This brings the sim ulation tim es low enough 
to be able to use th is technology as the  basis for feedback- 
directed optim isation, which can significantly improve the 
perform ance of application code.
I. INTRODUCTION
Reeonfigurable instruction cell based processors [1] are co­
arse grained reeonfigurable com puting fabrics, for use in 
em bedded system s. These reeonfigurable architectures fill 
the gap between traditional field program m able fabrics (such 
as FPG A s) and microprocessors. These architectures (in­
troduced in section III) are an emerging technology, and 
(heir designs are still being actively explored. Sim ulation is 
needed to allow for rapid modification and evaluation of the 
core design, avoiding the time needed to re-implement and 
test the core using a hardw are description language (IiDL) 
for an FPG A  im plem entation, or the cost of re-fabricating 
t he array. Furtherm ore, these architectures are intended to 
be provided as flexible IP blocks, where the  end-user can 
make significant changes to  the make-up and functional­
ity of the core. The end-user expects a com plete toolchain 
to  be available th a t  is able to reflect- these changes, in or­
der for the complete hard w are/softw are design space to  be 
explored. Such a  toolchain normally consists of an opti­
mising compiler, and a sim ulator [2, 3]. T he application 
dom ains th a t these architectures are m ainly aimed a t tend 
to operate 011 large d a ta  sets, such as video playback (H.264 
decoding [4]), digital signal base-band processing [5], and
image signal processing (6]. As a  result, sim ulation tim e is a 
crucial factor in determ ining the length of the  arch itectu re 
definition cycle, and thus tim e to  m arket.
T he sim ilarities to a m icroprocessor m ean th a t  software- 
based sim ulation technologies trad itionally  used w ith mi­
croprocessors can be adapted  for these new architectures, 
by tak ing  account of the parallelism  in the  array. Sec­
tion II reviews traditional m icroprocessor em ulators, and 
their uses. However, reeonfigurable arch itectu res support 
operation chaining— the ability to  execute dependent and 
independent instructions within the  sam e clock cycle /  con­
figuration context—which trad itional em ulation technology 
cannot model.
Modelling parallelism  011 a  serial m achine has already been 
addressed in HDL sim ulation, particu larly  those intended 
for dynam ic reconfiguration [7]. These concepts are bor­
rowed to  derive an event-driven model th a t  cap tu res the 
d a ta  pa ths between processing elem ents in the array. Sys- 
tem C provides an object-orientated  event-driven model w ith 
a kernel similar to an HDL sim ulator, b u t described only 
a t the behavioural level in C.
This kernel-based approach of serialising in response to run­
tim e events imposes an overhead per configuration con­
tex t. For trad itional reeonfigurable and dynam ically  re- 
configurable hardw are, the  rate  of reconfiguration is low, so 
the overhead of updating  the  event-driven model 011 each 
configuration context represents only a  small fraction of 
the to tal execution time. However, reeonfigurable in struc­
tion cell based processors are reconfigured many m illions o f 
tim es per second, so th is overhead introduced by th e  model 
is large com pared to  the actual work done by the  operations 
of the modelled cells.
Therefore, moving th is overhead into a  pass prior to pro­
gram  execution is highly desirable. T h is is w hat the  software- 
based em ulator presented in this paper does. T he em ulator 
moves away from the  event-driven approach, and instead  
mimics the sam e order of d a ta  flow by generating a  s ta tic  
schedule of operations th a t  are perform ed sequentially. The 
algorithm  for generating th is schedule, along with the  re­
quired storage queues, is described in section VI. This is a 
new extension to  trad itional software-based em ulator tech- 
nology, allowing th is type of model to work w ith these 
emerging architectures. I11 section V II, the run-tim e per­
222
form ance of th e  proposed softw are-based em ulator is com­
pared w ith th a t  of a  System C event-driven sim ulator, and 
an F P G A  im plem entation of the  in struction  cell array. A 
se t of representative applications are run  on all th ree m od­
els. Section V III concludes.
II. EMULATION
Software-based em ulation of m icroprocessors has been used 
since a t  least th e  1970s [8]. Em ulation m odels th e  instruc­
tion  set of th e  ta rg e t arch itec tu re  by mimicking the  way 
th a t  th e  s ta te  of th e  C PU , registers, and m em ory is af­
fected by each operation  in the  instruction  set. T he fetch 
and execution of instructions in th e  em ulator is perform ed 
in the  sam e sequential m anner as in the  ta rg e t C PU . T ra­
ditionally, such em ulators have been custom -built to  a  par­
ticu lar ta rg e t arch itec tu re  and p latform  [9]. Since m ost 
C PU s are conceptually  sim ilar, these concepts can be ab­
strac ted , m aking the  em ulator extensible. T his is com­
m only achieved through ob ject-o rien tated  design [10, 11]. 
E m ulators are p a r t of m any m odern com m ercial tool sets
[12]. E m ulation  sees the  following uses:
B e h a v io u ra l  v a l id a t io n :  the  t arget arch itectu re and as­
sociated application developm ent toolchain can be proven 
before com m itting  to  silicon, or dedicating  tim e to  de­
tailed HDL sim ulation.
P r o d u c t /A p p l i c a t i o n  d e m o n s tr a t io n :  th e  ability  to 
add em ulated hardw are allows for applications to  be 
dem onstrated  in near real-tim e, before th e  hardw are is 
available.
P ro v id e s  a n  e a s ily  m o d if ia b le  t e s t  b e n c h : adding em­
ulated  hardw are a t the  behavioural level aids in devel­
oping peripherals, since these can be added to  the  em u­
lato r, and their usefulness or interface design explored. 
T h is makes it is easy to  try  out new ideas (platform  ex­
ploration), w ithou t having to  design them  beyond the 
behavioural level.
R e d u c e s  d e v e lo p m e n t  t im e : algorithm s can be tested  
and  tim ing inform ation estim ated  in a fraction of the 
tim e of o ther software-based sim ulation techniques avail­
able.
F e e d b a c k -d ir e c te d  o p t im is a t io n :  inform ation can be 
ex trac ted  about a  program  through profiling during  ex­
ecution on the  em ulator. T h is inform ation can then  be 
used by a  com piler [13] to m ake m ore informed decisions 
when applying optim isation  [14].
The generalisation of trad itio n a l em ulation concepts has 
also extended to  th e  point where em ulators can be au­
tom atically  generated  from an ab s trac t m achine descrip­
tion, along w ith an optim ising com piler/scheduler as p a rt 
of a retarge tab le  toolchain [2]. M achine description lan­
guages have progressed to  th e  exten t th a t  features of in­
creasingly complex arch itectu res can be cap tured , including 
deep pipelining of functional units, m ultiple instruction  is­
sue, and  the  design of th e  m em ory subsystem  [3]. However, 
these languages are not yet able to cap ture  the operation  
chaining available in reconfigurable processors, except by 
enum erating  every possible configuration, which would be 
im practical. However, such languages could be extended to
cap tu re  th is inform ation, and such a  descrip tion  could be 
used to  autom atically  generate a  sim ulato r using the  tech­
nology presented in th is  paper.
Developm ents in m odern com piler technology have exhausted  
much of the  po ten tial for s ta tic  optim isation , and so the 
tren d  is a sh ift tow ards feedback-directed op tim isation . As 
a result, an em ulator for th is purpose is likely to  becom e a 
significant p a rt of s ta n d ard  toolchains. W ith  th is in m ind, 
the  speed of sim ulation d irectly  affects the  scalability  of 
the  toolchain with respect to  ta rg e t applications, which are 
of ever increasing complexity. H ardw are acceleration has 
been com m only explored for use w ith em ulation [15, 16, 
17]. However, several of the  uses listed above m ake the  
requirem ent of add itional hardw are undesirable (if not im­
practical), and so a  software-only so lu tion  is the  m ain focus 
of th is paper.
III. RECONFIGURABLE INSTRUCTION CELL 
BASED PROCESSORS
Reconfigurable instruction  cell based processors [1] are co­
arse grained reconfigurable com puting  fabrics, consisting 
of a  heterogeneous array  of program m able cells on a  pro­
gram m able in terconnect network. An exam ple can be seen 
in fig. 1. T he cells perform  operations sim ilar to  those 
found in a  conventional a rithm etic  logic un it (ALU), and 
can be com bined through th e  reconfigurable in terconnect 
to  perform  m ore com plex instructions in a single configu­
ration  cycle (context). A configuration context persists for 
a tim e sufficient for the sequence of connected cells w ith 
the  longest propagation  delay (the critical pa th ) to  com­
plete, and then  the next configuration context is loaded. 
T he nex t configuration context can be chosen arb itrarily , by 
program m ing the  ju m p  cell. T h is way, a rb itra ry  program  
control flow is possible. Essentially, the  arch itectu re can 
look like a  m ultiple-issue m icroprocessor, w ith a very large 
in struction  set. A reconfigurable fabric exploits parallelism  
and thus can achieve higher perform ance th an  o ther repro­
gram m able technologies, like m icroprocessors or DSPs. It 
essentially offers the p rogram m ability  of a  microprocessor- 
based solution, bu t w ith power consum ption and perfor­
m ance approaching th a t  of an ASIC.
IV. THE MODELLED SYSTEM
DataftAM; 
i (4 banks) i i
'~ü^ Memory Inlortaco
r- ^ Z . V r ^ - ,If MEM \( MEM\/" MEM V MEM "\ !  ̂Access Accom Accom Access J j
E 2 ̂ H Q Q D t
2 ccCL g g g g
Reconfigurable Core
Figure 1: Modelled system: reconflgurable core (simplified), 
memory, and example peripherals.
223
An exam ple system  th a t can be modelled with the emula­
to r is shown in fig. 1, and consists of the instruction cell 
array  core, with separate program  and d a ta  memories, and 
some simple peripherals. In this example, d a ta  memory is 
arranged in m ultiple banks, accessed through special cells 
in the array. Since more th a n  one m em ory access cell is pro­
vided in the  array, m ultiple accesses can be performed by 
the  core in one configuration context. If all such accesses 
are to different banks, then  these accesses are performed 
in parallel. O therwise, conflicting requests are performed 
sequentially, which incurs a  dynam ic delay. The em ulator 
can be used to characterise mem ory access pa tterns and use 
the results to direct scheduling [18] and linking of th e  pro­
gram to  optim ise access to  d a ta  mem ory (feedback-derived 
optim isation, m entioned in section II).
V. EXTENSIBILITY
Software em ulations of m em ory-m apped peripherals such 
as a  DMA controller, video frame buffer, or audio buffer 
can easily be added. These com m unicate with the core 
ju s t like they would in real life: either through the mem­
ory interface, 01* through special-purpose cells in the array. 
New instruction cells can be added to the  core simply by 
defining a new object. New m em ory-m apped peripheral 
modules can be added by defining a  new object for the  pe­
ripheral, which responds to  events from the  m em ory inter­
face through known m ethod calls, see fig. 2, in response to 
activity  011 the appropriate  addresses. Peripherals can have 
a  kernel th a t  operates 011 a separate thread , if they are to 
perform  operations th a t are independent to  the core. The 
video fram e buffer em ulation is an example of this: it per­
forms colour space conversions and renders frames largely 
011 its own tim e-base. More complex peripherals, such as 
a DMA controller, can be added th a t connect to  both  the 
memory interface and to  the array via special control cells. 
These could be im plem ented by creating  a new object for 
the  special cell, and allowing the cell object to  communi­










Figure 2: Pseudo-code fo r m em ory interface.
VI. EMULATOR TECHNOLOGY
T he em ulator is an object-orientated program  w ritten  in 
C + + , and is m odular in design. Each hardw are compo­
nent mentioned in section IV is represented by a class (ob­
ject), and they com m unicate with each other via m ethod 
calls. The model of t he core is simply a  set of instruction 
cell models, each of which contains the  s ta te  information 
th a t the  real cell would m aintain, and a  set of cell ‘actions’ 
which cap ture the  behaviour of th a t cell. The cell actions 
are im plem ented as C + +  ‘m ethods’. The operation of a
given cell is represented by one or more of th e  following 
cell actions:
E v a lu a te :  Assign the  o u tp u t value of th e  cell a n d /o r 
modify the internal s ta te  of the  cell according to  the  
configuration word.
O p e ra te :  Assign the o u tpu t value of the  cell according 
to the  configuration word and th e  values read from its 
input(s).
U p d a te :  Modify the in ternal s ta te  of the  cell according 
to the configuration word and values read from the  in- 
put(s).
A serialised configuration context consists of the  ‘evaluate’ 
actions (scheduled in any order), followed by the  ‘o p e ra te ’ 
actions (specifically ordered by th e  serialisation algorithm  
described in section A.), followed by the  ‘u p d a te ’ actions (in 
any order). Cells th a t perform only simple com binatorial 
operations— which calculate an o u tp u t value based 011 the 
values of their inputs— im plem ent only the  ‘o p era te ’ action. 
The code sam ple in fig. 3 dem onstrates a simplified version 
of an ‘A D D ’ cell, which is an example of a  com binatorial 
cell.
object Add_cell extends Instruction_cell 
{
properties:
- output // Storage for cell’s output, 
constructor:





case ADD_ADD_SI: // Single integer, 
output = ini + in2 
case ADD_SUB_V2HI: // Vector mode, 






Figure 3: S im plified  ADD cell class im plem entation  pseudo­
code.
The em ulator parses the  netlist describing th e  ta rg e t pro­
gram, then  serialises the  operations of each configuration 
context into a  sequence of equivalent cell actions. These 
serialised operations are stored in an internal d a ta  model. 
The serialisation process is described in section A.. Execu­
tion of the program  then proceeds: these sequences of cell 
actions for each configuration context encountered are ex­
ecuted in a large s ta te  machine by calling the  appropriate  
virtual function, as shown in fig. 4.
T he model of each cell contains a variable th a t holds the 
value for the cell’s ou tpu t port. This can then  be referenced 
(read) by the  actions of cells th a t depend 011 th a t  value (the 
‘in p u t’ vector passed into the operateQ  m ethod). Note th a t 
the program  counter can also be updated  via the  cell actions 
(for the jum p cell), and th is determ ines which configuration 
context will follow. A configuration context is th e  sm allest
224
// Execute steps until end condition is detected.
do
{
step index = jumpcell program counter value 
this step = program[step index] 
for each cell action in this step 
{















while jump cell hasn’t detected end
Figure 4: Core execution loop pseudo-code.
un it th a t  can be used as the ta rg e t for jum ps. T he d a ta  
m em ory is m odelled as a  sim ple array  w rapped by an object 
th a t  provides an interface to  read and w rite to  the  memory, 
as described in section V.
A. SE R IA LISA TIO N  A LG O R IT H M
T he serialisation algorithm  is used to  create  the  in ternal 
representation  which drives th e  execution s ta te  machine. 
T he key requirem ent of th is algorithm  is to  ensure th a t 
the  result of executing the  sequence of cell actions in the 
execution m odel, exactly  m atches th e  result of the  origi­
nal d a ta  flow graph for th a t  configuration context. This 
sim ply requires th a t  a  cell’s action (for calculating its o u t­
p u t value) is scheduled before those of any dependent cells 
(successors). T he serialisation algorithm  requires extension 
to  deal w ith situa tions where cells m aintain in ternal sta te  
from one configuration context to  the  next. To explain this, 
we first give an exam ple involving only com binatorial cells, 
th en  a  second exam ple showing the  extension required to 
avoid ap p aren t connection loops arising from internal sta te .
1 2,7
Figure 5: Example configuration contexts: (a) involving only 
combinatorial operations, (b) including a ‘connection loop’— 
this case is valid since the loop involves a register, which is a 
‘terminal cell’.
T he d a ta  flow graph exam ple for a  configuration context 
involving only com binatorial cells is given in fig. 5(a). T he 
operation  of any purely com binatorial cell needs only an 
‘o p era te ’ action  to  be defined. T he constan t cells supply 
the  operands for a  se t of operations, and the  final resu lt is 
w ritten  to  storage. A hum an m ight choose a  sequence such 
as th a t  shown by th e  num bers in fig. 5(a). T he algorithm  
employed by the em ulator constructs th e  connection hier­
archy between the  active cells as a  directed graph. Once 
th e  hierarchy is com plete, the  topological so rt operation  
from graph theory  is applied to  the graph. T h e  topologi­
cal sort resu lts in nodes being ordered in descending order 
of dep th  in the  connection hierarchy. T he ordered result is 
used to  schedule the  ‘o p era te ’ actions of each cell. W ith in  
a given dep th , the  cell actions could be scheduled in any 
order, w ithou t affecting the  overall result. T he direction of 
th e  arrows in fig. 5 indicates th e  direction of d a ta  flow, and 
defines the  term inology of predecessor feeding d a ta  to  a  suc­
cessor, i.e. one of th e  successor’s inpu t po rts  is connected 
to  the  o u tp u t po rt of the  predecessor. In the com pleted hi­
erarchy, a  predecessor lies in som e level lower th an  th a t  of 
any of its  successors.
Things are a  b it more com plicated th an  th is , however, be­
cause some cells m ain tain  in ternal s ta te  inform ation. T ak­
ing registers as an exam ple, the  o u tp u t of th e  cell does not 
depend on the  inpu t in the  curren t configuration context; 
instead it depends on th e  in ternal s ta te  of the  register cell 
(which in tu rn  usually depends on the  in p u t to the reg­
ister from a  previous configuration context). T his m eans 
th a t  it is valid for a register to  appear in a  ‘connection 
loop’— where the  o u tp u t of the  register is used in some se­
quence of operations, the  result of which is sto red  back in 
the  sam e register. T his resu lts in a  cyclic g raph, making 
a  topological so rt impossible. Essentially, the  register cell 
can be though t of as two cells— one em itting  the curren t 
value, and one receiving th e  new value. However, th is is 
no t a  clean approach.
A lternatively, we can in troduce th e  concept of ‘term inal 
cells’—i.e. cells where th e  inpu ts do no t affect the  o u tpu ts 
during the execution of th a t  configuration context. Now, 
connection loops are valid if one of the  cells in the  loop is 
term inal. Term inal cells provide an ‘evaluate’ action , in ad­
dition to  an ‘o p era te ’ action. C alculating  the o u tp u t value 
of a  term inal cell can always be done before anything else 
during  the  execution of a configuration context (since the 
value does not depend on the  result, of any o ther cell during 
th a t configuration con tex t), and w riting to  th e  inpu t(s) of a 
term inal cell can  always be done after any th ing  else during 
the  execution of a. configuration con tex t (since the  w ritten  
value does no t affect any o ther cells during  th a t  configura­
tion context). Furtherm ore, some cells need to  have their 
s ta te  modified upon each configuration context transition  
(reconfiguration). T h is is done by providing an  ‘u p d a te ’ 
action, th a t  is perform ed once th e  rest of the actions have 
been executed. So, the  algorithm  is extended by schedul­
ing all ‘evaluate’ actions first, followed by the  sequence of 
‘o p era te ’ actions obtained  from  the  topological so rt, and fi­
nally all ‘u p d a te ’ actions are scheduled, fig. 5(b) shows an 
exam ple, to which the  algorithm  would assign the  following
225
sequence of cell actions:
const[0](evaluate), const[1](evaluate), reg[0](evaluate), 
add[0](oper.), div[0](oper.), regtO](oper.), reg[10](oper.), 
reg[0](update), r eg[10](update)
Registers are only a  simple exam ple of th is problem. More 
complex exam ples include interfaces to  stream ing memo­
ries, and  cells th a t  are internally  pipelined such th a t their 
o u tp u t is delayed by (several) iterations. It has so far 
proven possible to  m ap all supported  cells to th is mech­
anism, and th is approach is quite effective in minimising 
the num ber of operations th a t  need to  be performed for 
each configuration context.
VII. PERFORMANCE
T he perform ance of the  em ulator was com pared against a 
System C transaction-level model of the sam e instruction 
cell-based processor, and an FPG A  im plem entation of the 
sam e array  (i.e. a  dynam ic reconfigurable fabric on a s ta t ic  
reconfigurable fabric). A quad 2.2GIIz AMD O pteron PC 
was used as the host m achine for the em ulator and SystemC 
model. The FPG A  used was the Virtex-4 LX 160. Note 
th a t  an FPG A  im plem entation perform s the  sam e role as an 
IIDL sim ulation of the processor architecture, and is used 
instead of an HDL sim ulation since it achieves much higher 
run-tim e perform ance, and so is much more suitable for the 
task of near real-tim e application dem onstration. The exe­
cution speed was used as the m easure of performance. The 
reconfigurable array  is intended to have a  system  clock of 
500MHz. T he maximum achievable clock on the FPG A  im­
plem entation of the  target processor is 12MHz (determ ined 
by the critical p a th  of the synthesised instruction cell ar­
ray rendered 011 the FPG A , which is the sam e irrespective 
o f the target application). T he ratio  of these gives the  per­
formance value for the FPG A . For the o ther m ethods, the 
execution tim e w as accurately m easured and averaged over 
several runs. T he averaging is necessary for user-space pro­
gram s, in order to reduce random  error introduced by pre­
em ptive context switches on the  host. Execution speed is 
the tim e th a t  the  ta rg e t application should have run for 011 
the reconfigurable array, divided by the average run  time 
011 the model. T he following algorithm s/applications were 
used:
•  D iscrete Cosine Transform  (DOT) (for M PEG 4/H .264 
video).
•  F inite Impulse Response (FIR) digital filter.
• D hrystone (integer C PU  perform ance bench-m ark).
• M P3 (MPEG-1 layer 3) audio decoder (libmad).
•  11.264 video decoder (ffmpeg).
Table 1 shows th a t the perform ance of the em ulator de­
scribed in this paper is good com pared to the o ther simula­
tion m ethods described. The real silicon (native) is between 
21 and 101 times faster th an  th e  em ulator, and the FPG A  
model is close in speed to the em ulator. Since the  FPG A  
is a  model of the real silicon, it is a  constant fraction of the 
speed of the real silicon. Both software models vary in ex­
ecution speed (com pared to the real silicon), depending on 
the application.





FIR 1.000 3.40e-3 0.52 21
DCT 1.000 5.52e-3 1.47 61
H.264 1.000 9.44e-3 1.43 59
MP3 1.000 12.00e-3 2.43 101
Dhrystone 1.000 76.00e-3 0.S3 34
Table 1: Execution speed for various standard benchmarks, 
normalised to the speed of the emulator.
The relative perform ance of the  em ulator and System C 
model can also be seen to  depend on the application. Since 
these two models use very sim ilar cell im plem entations, 
w ritten in C, th is highlights the  differences in the  over­
heads incurred by the  m ethod of sim ulation. In addition  
to  performing the actual work of th e  cells, the System C 
kernel incurs an overhead for each event, generated by the  
active cells, and a fu rther overhead a t the end of each con­
figuration context. The em ulator on the o ther hand , only 
incurs the la tte r  overhead, since everything except for the 
path  of program  execution is serialised prior to  execution. 
The D hrystone exam ple consists of m any short basic blocks, 
which results in very low core utilisation. T h is represents 
the extrem e of frequent configuration context changes with 
few cell operations in between. T he F IR  exam ple represents 
the opposite extrem e, where the program  consists largely 
of one basic block, which results in very high core utilisa­
tion, and much core activity  between configuration context 
switches. The results in table 1 show th a t  the  em ulator is 
best advantaged when core utilisation  is high, which sup­
po rts th is argum ent.
Total ops. C ritical Relative
per iter. pa th speed
Parallel 26 16ns (5 ops.) 1246x
Com binatorial 26 40ns (9 ops.) 1377x
Sequential 11 16ns (5 ops.) 2258x
Table 2: Complexity and relative execution speed (emulator 
v.s. SystemC model) for some simple test programs written to 
investigate the reason for the application-dependent relative 
execution speed.
To examine this further, some small te s t program s were 
w ritten , each consisting of a  single loop m apping to  a single 
configuration context. In each case, th e  loop body consists 
of a relatively simple sequence of arithm etic  operations to  
apply to each mem ber of a d a ta  set. T he program s dif­
fer in when the operations for a  given m em ber of th e  d a ta  
set are executed. T he program s are shown in table 2. To 
tes t the effect of the  num ber of events generated per iter­
ation, one program  (‘Paralle l’) perform s the  operations of 
four members of the  d a ta  set in parallel; w hilst ano ther 
program  (‘Com binatorial’) also operates on four m em bers 
of the d a ta  set per iteration , but a d a ta  dependency exists 
preventing them  from running entirely  in parallel (however 
they still overlap to  a  certain  exten t). T he num ber and 
type of operations perform ed per ite ra tion  in bo th  of these 
program s is the same; however the la tte r  (‘C om binato rial’) 
case has a longer critical path . T he relative perform ance of 
the  em ulator and System C model is sim ilar for bo th  pro­
grams. The execution tim e of th e  em ulator should depend
226
only on the  operations perform ed, and no t the  order. For 
th e  System C  m odel, the longer critical p a th  (and num ber 
of operation  chained together) causes m ore flu tter as the 
com binatorial pa th s stabilise, resulting  in m ore transition  
events being generated . However, the  execution tim e for 
each event is very sm all com pared to  th e  tim e taken to 
schedule th e  events, and the results in fact show a  slight 
relative gain. T h is indicates th a t  the  run-tim e scheduling 
is easier when th e  tim ing of th e  events is m ore sequential. 
To test th e  effect of the  num ber of operations per ite ra ­
tion, ano ther program  was w ritten  (‘Sequential’), th is tim e 
w ith only one m em ber of th e  d a ta  set operated  on per iter­
a tion  of th e  kernel. T his requires th a t  four tim es as m any 
ite ra tions are perform ed. A significant increase in th e  rel­
ative speed of th e  em ulator can be seen com pared to  the 
previous tes t program s. T his therefore indicates th a t  the 
System C model incurs a  d isproportionate ly  large overhead 
per ite ra tion , which supports the  earlier observation w ith 
th e  sta n d ard  benchm arks.
VIII. CONCLUSIONS
E xisting  software-based m ethods of sim ulation for reconfig- 
urable com puting  arch itectu res are event-driven, and incur 
a  sizeable tim e penalty  for every configuration context. For 
in struction  cell architectures, which have to  be reconfigured 
m any millions of tim es per second, th is overhead eclipses 
the  ac tual work done by the  modelled processing units. 
A novel approach was suggested to  reduce th is overhead, 
by moving the  resolution of the  dependencies in the  d a ta  
p a th s in to  a pre-processing stage, prior to  execution. W hen 
applied to  an  exam ple processor, the  results (section VII) 
show th a t  th e  execution speed achieved using th is  new ap­
proach is around two orders of m agnitude higher th an  an 
equivalent System C m odel, and largely m atches the  speed 
of an  F PG A  m odel of th e  targ e t reconfigurable instruction  
cell array.
This level of perform ance makes the  proposed em ulator 
su itab le  for use in feedback-directed optim isation , and  thus 
could be an im p o rtan t p a r t  of fu tu re toolchains. F u rther­
more, the  em ulator is highly adap tab le  to  different types 
of reconfigurable processors w ith different functionality  bu t 
sim ilar control concepts, m aking it a good cand idate  for use 
in retarge tab le  toolchains for hardw are/softw are co-design. 
In add ition , the  serialisation algorithm  can be applied di­
rectly to  transla tion , allowing even faster em ulations to  be 
perform ed. T h is would be achieved by using the  o u tp u t 
of th e  serialisation algorithm  to  generate s ta tic  call lists for 
each configuration con tex t, which are then  fed into an op ti­
m ising linker (such as LLVM [19]) to  generate an optim ised, 
native binary, elim inating the overhead of in terp re ta tion .
IX. REFERENCES
[1] S. Khawam, I. Nousias, M. Milward, Y. Yi, M. Muir, and
T. Arslan, “The reconfigurable instruction cell array," IE E E  
Transactions on Very Large Scale Integration (V L SI)  
System s, vol. 16, no. 1, pp. 1-11, 2008.
[2] A. Hoffman, T. Kogel, and H. Meyr, “A framework for fast 
hardware-software co-simulation," in Design A utom ation  
and Test in  Europe, international conference on, 2001, 
pp. 760-764.
[3] A. Halambi, P. Grun, et al., "EXPRESSION: A language for
architecture exploration through compiler/simulator 
retargetability," in Design A utom ation  and Test in  
Europe, in ternational conference on, 1999, pp. 485-490.
[4] A. Major, T. Arlsan, et al., "H.264 decoder implementation on 
a dynamically reconfigurable instruction cell based 
architecture," in International SO C  Conference, 2006, pp. 
49-52.
[5] Z. Khan, T. Arlsan, et a l, “Implementation of a real time 
programmable encoder for low density parity check code on a 
reconfigurable instruction cell architecture," in Design 
Autom.ati.on Conference, A sia  and South Pacific,
2007, pp. 583-588!
[6] M. Muir, T. Arslan, and I. Lindsay, "Automated dynamic 
throughput-constrained structural-level pipelining in 
streaming applications,” in Design A utom ation  and Test 
in Europe, in ternational conference on, 2008, p. TBA.
[7] P. Bellows and B. Hutchings, "JHDL - an HDL for 
reconfigurable systems," in F P G A s fo r  Custom  
Com puting M achines, IE E E  symposium, on, 1998, pp. 
175-184.
[8] L. Robertson, "Anecdotes," A nnals o f the H istory o f 
Computing, IEEE, vol. 27, no. 2, pp. 82-84, 2005.
[9] H. Diab and I. Demashkieh, "A reconfigurable microprocessor 
teaching tool," in Science, M easurem ent and Technology, 
I  E E  Proceedings, 1990, pp. 287-292.
[10] C. Cooper and P. Werstein, "The use of Java to develop a 
microprocessor emulator," in Software Engineering: 
Education arid Practice, 1998, pp. 272-277.
[11] W. Zaatar and G. E. Nasr, “An implementation scheme for a 
microprocessor emulator," in IC E C S Electronics, Circuits 
and System s, 7th in ternational conference on, 2000, 
pp. 169-172.
[12] S. Bush, "ARM offers real-time prototyping capability," 
Electronics Weekly, no. 39339, July 2006.
[13] E. R. Altman, S. Sathaye, and M. Gschwind,
"Execution-based scheduling for VLIW architectures," in 
E uro-P ar’99— Parallel Processing, international 
conference on, 1999, pp. 1269-1275.
[14] R. Cohn and P. G. Lowney, "Feedback directed optimisation 
in Compaq's compilation tools for Alpha," in Proceedings o f 
the 2nd A C M  Workshop on Feedback-directed 
optim isation , 1999.
[15] M. Gschwind, V. Salapura, and D. Maurer, "FPGA prototyping 
of a RISC processor core for embedded applications," IE E E  
Transactions on V LSI System s, pp. 241-250, 2001.
[16] Y. Nakamura and K. Hosokawa, "Fast FPGA emulation based 
simulation environment for custom processors," IE IC E  
transactions on fundam entals o f electronic 
com m unications in com puter science, vol. E89-A, pp. 
3464-3470, 2006.
[17] S. Fink and E. Sanchez, “Development and prototyping for an 
8-bit multitask micropower processor," in Proceedings o f the 
6th IE E E  International W orkshop on Rapid System  
Prototyping, 1995, pp. 75-78.
[18] Y. Yi and I. Nousias, “System-level scheduling on instruction 
cell based reconfigurable systems," in Design A utom ation  
and Test in Europe , in ternational conference on, 2006, 
pp. 381-386.
[19] C. Lattner and V. Adve, "The LLVM compiler framework and 
infrastructure tutorial," in M ini W orkshop on Compiler 
Research Infrastructures (L C P C ’04), 2004.
227
Automatic Dynamic Structural-level Pipelining in 
Reconfigurable Processors
Mark M uir1, Nazish Aslam 2, 
loannis Nousias1, Adam M ajor1, 
Tughrul A rslan12, lain Lindsay1
(1)The University of Edinburgh (2)Institute for System Level Integration
Mayfield Road, Edinburgh A lba Centre, L ivingston
United Kingdom, EH3 9JL United Kingdom, EH54 7EG
mark.muir@ ed.ac.uk
ABSTRACT
This paper describes a technique for automated dynamic structural- 
level pipelining o f programs targeting dynamically reconfigurable 
processors with very short reconfiguration times. These architec­
tures are particularly suited to stream ing applications, whose pri­
mary market are low-cost, high-volume consumer products such 
as the image signal processor for digital cameras in modern mo­
bile phones. These reconfigurable processors open up the ability 
for vendors to differentiate their products by providing their own 
algorithms. To minimise area (and thus cost), it is important that 
vendors have the ability to tailor the resources o f the core to their 
needs. Therefore, the process o f application development is that of 
hardware/software co-design. As part of a high-level software tool 
chain, we present an optimisation technique that can be added to the 
compiler to significantly improve the throughput o f applications, by 
pipelining tight loops (kernels) which perform the majority o f the 
work in streaming applications. This allows more complex algo­
rithms to be deployed whilst still meeting the available timing bud­
get. The timing constraint is determined automatically, in a m anner 
which maximises throughput within a given resource budget.
1. INTRODUCTION
The choice o f platform for many modern digital signal processing 
tasks in embedded systems is often limited to application-specific 
integrated circuits (ASICs), since off-the-shelf programmable ar­
chitectures such as DSPs and microprocessors cannot meet the th­
roughput requirements, whereas reconfigurable hardware such as 
field-programmable gate arrays (FPGAs) require too much area 
and power. However, for applications that demand an element 
o f reprogrammability, streaming processors (such as those offered 
by Ambric [1] and SPI [2]) are becoming an increasingly attrac­
tive solution, which improve on throughput by providing multiple 
processing elements/cores with an interconnect structure suited to 
streaming. However, these processing elements— usually based on 
regular DSP designs— often equate to significant silicon area. Al­
ternatively, coarse-grained dynamicaly reconfigurable architectures 
(DRAs) offer a high degree of parallelism, sufficient to achieve 
high throughput [3][4]. Thus fewer cores are required for a given 
application, leading to a much lower area overhead. These coarse­
grained architectures, if  given the ability to control their own recon­
figuration, can be reconfigured very rapidly (e.g. millions o f  times 
per second), in order to achieve control flow sim ilar to a regular mi­
croprocessor. This paper focuses on maxim ising the performance 
o f programs running on a single core. However, the techniques can 
be directly applied to programs running on additional cores in a 
complete stream ing application.
Coarse-grained DRAs, such as instruction cell based processors
[5] [6], provide a high degree o f instruction chaining inside the core, 
by allowing arbitrary connections to be made between the various 
functional units via a configurable routing network. This allows 
quite complex data paths to be rendered onto the fabric and ex­
ecuted in a single configuration. This makes these architectures 
particularly suitable to stream processing, as fewer fetches from 
program memory are required. Performance is optimised by at­
tempting to match the size o f each kernel (inner loops where most 
of the execution time is spent) to the available resources, allowing 
them to fit into a single configuration context. This allows the con­
figuration to persist for many clock cycles, operating on new data 
on each cycle. This increases throughput, since no time is spent 
having to reconfigure the core between successive iterations. It 
also decreases power consumption, as the configuration only needs 
to be fetched from program memory (or cache) once— upon first 
entering the kernel— rather than on every iteration. However, the 
resulting data paths can often have a long critical path, leading to 
poor temporal utilisation o f the functional units, since they have to 
wait until all functional units have completed before operating on 
the next batch o f data, which limits the throughput.
Pipelining provides a way o f starting to operate on a new batch 
o f data before an old one has completed. Thus, this allows the 
lunctional units o f multiple stages of the kernel to be active con­
currently; each operating on a different batch o f data. O thers have 
devised loop pipelining techniques for reconfigurable architectures 
[7, 8, 9], where successive iterations o f the loop are replicated in 
hardware, and offset from each other to deal with any data de­
pendencies between the iterations. These are most suitable for 
large reconfigurable architectures with much longer reconfigura­
tion times, where there are sufficient resources for the entire loop 
body to be replicated many times. This paper elaborates on and 
extends work in a previous paper where structural-level pipelin­
ing techniques were shown to be applicable via software to rapidly 
reconfigurable/programm able architectures supporting operation- 
chaining. The technique allows complete kernels that were mapped 
to a single configuration context, to have their critical path length 
decreased by the addition o f pipeline stage registers. Pipeline fill­
228
ing and flushing are achieved through dynamic reconfiguration.
The contribution in this work is the ability to automate the tasks of 
identifying configuration contexts which could benefit from pipelin­
ing, and choice of critical path constraint. In particular, in order 
to reduce power in the target architectures, the master clock fre­
quency is kept as low as possible. Configuration contexts are al­
lowed to persist for multiple clock cycles, until their critical path 
has completed. Pipelining reduces the critical path, so as a result, 
the quantisation introduced by the master clock frequency affects 
pipelined contexts more. Therefore, it is important to minimise the 
wasted time between the critical path stabilising and the next mas­
ter clock cycle. The automatic pipelining algorithm demonstrated 
here attempts to do this.
Section 2 reviews existing pipelining techniques, and relevant soft­
ware optimisation techniques. Section 3.1 describes an algorithm 
to perform pipeline stage allocation, and section 3.2 shows how 
properties of dynamic reconfiguration can be used to fill and flush 
the resulting pipeline. Section 3.3 details how the task can be com­
pletely automated. Section 4 shows the result of applying this tech­
nique to a real-life kernel used in image processing.
2. PR EV IO U S W O RK
For architectures that support instruction chaining, scheduling in­
volves mapping as many dependent and independent data paths into 
as few configuration contexts as possible [10], Independent data 
paths run in parallel, so the time for which a configuration persists 
is determined by the maximum critical path length of these data 
paths. If sufficient functional unit resources are available, loops 
can be optimised by loop unrolling [11]— i.e. placing multiple it­
erations as independent data paths in the same configuration. This 
allows multiple iterations to begin and end at once. This does not 
change the original critical path length, yet can increase the throu­
ghput. The throughput is determined by the critical path length of 
a loop iteration and the number of iterations that can be performed 
at once. During each execution of the loop configuration context, 
data propagates through the operation chains until the final result is 
ready. This means that the functional units involved in that chain 
are only performing useful work for a fraction of the time. This is 
where structural-level pipelining of these data paths comes in— to 
artificially reduce the critical path length by allowing new iterations 
to begin without waiting for the completion of previous iterations.
Various approaches of pipelining data paths have been proposed
[12], These require that the designer specifies a throughput con­
straint, in order to allow the algorithm to best make the choice be­
tween throughput and the area overhead each pipeline stage intro­
duces. These approaches describe various algorithms for the task 
of pipeline stage allocation, applied to a number of different lev­
els in a design. On reconfigurable architectures such as FPGAs, 
custom pipelines can be rendered as part of the configuration, lead­
ing to significant increases in throughput [13]. The previous work 
on dynamically pipelining DRAs [14] proposed a technique where 
pipelining would be performed based on a critical path constraint 
provided by the application developer. The work here elaborates 
on this technique, and looks into more detail on the real-life perfor­
mance. Extensions are proposed to automate the choice of critical 
path constraint, and to maximise the real-life throughput.
3. DYNAM IC PIPELIN ING
Conventional structural-level pipelining can be applied to single 
configuration context kernels with long critical data paths, in or­
der to reduce the critical path, and thus increase throughput. This 
is done as part of the configuration— i.e. pipelines tailored to the 
particular kernel are rendered onto the core at runtime. This is done 
using existing register resources in the core to delay values for a sin­
gle execution cycle, allowing values to be bridged across pipeline 
stage boundaries.
Structural pipelining is applied to the kernel basic block by first as­
signing each operation in the original data llow graph to a pipeline 
stage. Then, registers are introduced to store values over bound­
aries between pipeline stages. Only those values that are used in 
later pipeline stages are stored. A new register is needed for each 
value for each pipeline stage boundary over which it must persist. 
Figure 1 shows an example kernel before and after structural-level 
pipelining. The example includes only simple feedback chains con­
sisting of a simple increment of the value of a register, however 
more complex feedback chains are also possible.
3.1 Pipeline stage allocation
First, constraints are defined between operations, where the order 
of execution is important. Examples include ‘same stage or earlier’ 
constraints between operations reading from input registers and op­
erations that have those same registers marked as global output reg­
isters, and ‘same stage or earlier’ constraints between data memory 
read operations and potentially aliasing data memory write oper­
ations. All operations in a feedback chain must be placed in the 
same pipeline stage, since such chains require single-step total la­
tency in order to keep the pipeline full. The algorithm for assigning 
pipeline stages to each operation is as follows:
•  Identify the ‘jum p’ operation, and all of its dependencies. Save 
this in a set— the ‘jump chain’ set.
• Create the ‘remaining’ set— a record of those operations yet to 
be assigned to a pipeline stage. This is initially populated with 
all the operations except for those in the ‘jump chain’ set.
•  Define the constraints:
-  Add ‘same stage or earlier’ constraints between operations 
reading from input registers, and operations that have those 
same registers marked as global output registers.
-  Add ‘same stage or earlier’ constraints between data memory 
read operations and potentially aliasing data memory write 
operations.
-  Add ‘same stage or earlier’ consU'aints between volatile op­
erations of the same kind, to ensure that they still appear in 
their original order.
•  Detect feedback chains:
-  Identify all the operations that are part of each feedback chain, 
and record them in a set for each chain. These shall be re­
ferred to as the ‘feedback’ sets. No operation in a feedback 
set may be assigned to a pipeline stage until all the operations 
in that set are ready to be assigned.
•  Create an ordered list of pipeline stages, initially consisting of a 
single entry. Each entry contains the set of operations that have 
been assigned to that pipeline stage.
•  For each operation in the ‘remaining’ set:
-  Create a temporary set containing this operation and any op­
erations in the same ‘feedback’ set (if one exists).
-  Determine whether any of the operations in the temporary set 
have any successors that are also in the ‘remaining’ set. If 
they do, then the temporary set is not ready, so discard it and 
move on to the next operation in the ‘remaining’ set.
229
^ W ÿ E M ^  ^ A D D ^  ^ W M ^
(a)
Stage 1
f  MUL J  i  a  (R/W) J f  MUL J  
f  ADD )  f  WMEM 1
 (b)
Stage 3
1: Example kernel data flow graph, (a) before pipelining, (b) after pipelining (kernel loop context). The inserted pipeline stage 
registers are shown in red. The per-cycle critical path is shown in hold, and is shorter in (b), which allows for a higher throughput.
-  Determine whether any constraints involving the operations 
in the temporary set involve operations that are also in the ‘re­
maining’ set. If they do, then the temporary set is not ready, 
so discard it and move on to the next operation in the ‘remain­
ing’ set.
-  Identify the latest pipeline stage where all the operations in 
the temporary set could be placed, according to their depen­
dencies and constraints.
-  Construct a configuration context containing all the pipeline 
stages constructed thus far, and calculate its critical path delay 
(including the reading from and writing to pipeline registers).
-  Speculatively construct a configuration context containing all 
the pipeline stages constructed thus far, including the opera­
tions from the temporary set, placed in the previously identi­
fied pipeline stage. Calculate its critical path delay.
-  If the critical path delay is different (i.e. increased), and the 
new delay exceeds the target, then move to the preceding 
pipeline stage (creating a new pipeline stage at the beginning 
of the list, if the chosen stage was the first in the list).
-  Transfer the operations from the temporary set to the iden­
tified pipeline stage, and remove them from the ‘remaining’ 
set.
-  Loop whilst the ‘remaining’ set is not empty.
•  Add the operations from the ‘jump chain’ set to the first pipeline 
stage.
The algorithm is a form of list scheduling. Only operations whose 
predecessors (in the data path) have already been assigned a pipeline 
stage may be considered for insertion on each pass. In order to 
minimise the register count, operations should be placed in as late 
a pipeline stage as possible. Operations that must be placed in the 
same stage are dealt with together. Operations are considered for 
placement in the latest pipeline stage containing any of their prede­
cessors. Then, the insertion point is moved towards later pipeline 
stages until all constraints have been satisfied. Once a valid inser­
tion point has been identified, the critical path is calculated for the 
resulting (incomplete) configuration context with the operation in
that pipeline stage. If the critical path meets the target value, the 
operation is placed in that pipeline stage. Otherwise, the operation 
is added to the next pipeline stage (creating it if it does not exist).
The creation of dependencies ensures that the sequence of state 
changes is maintained, thus ensuring correct results. Assigning op­
erations to a late a pipeline stage as possible aids to reduce the 
number of registers required. Once the pipeline .stages have been 
determined, pipeline stage registers are assigned as follows:
•  For each pipeline stage in sequence:
-  Assign a new register storing the value produced by each op­
eration in all previous pipeline stages that needs to be stored 
for use in this or any later stage.
3.2 Dynam ic in itialisation and clean-up
Normally, a pipelined design would require additional logic to take 
care of initialising the pipeline stages, or to suppress the opera­
tions in later pipeline stages until the previous stages have filled 
(predication), so that they do not operate on garbage. However, 
the pipelines in a coarse-grained DRA are themselves rendered as 
part of the configuration context. Provided that the configuration 
time is not significantly larger than the execution time of each step, 
dynamic reconfiguration can be used to render different configura­
tions before the main kernel loop configuration, to fill successive 
stages of the pipeline, and similarly to flush the pipeline after ex­
iting the kernel loop. This allows the kernel loop configuration to 
assume that the pipeline stages are always full. This provides a 
generic, purely software alternative to predication, which can be 
used as a fall-back when no hardware support exists.
Prologue: New configuration contexts are created to initially fill 
each successive stage of the pipeline. For n  pipeline stages, 
n  -  1 pipeline filling contexts are created.
Loop: A single configuration context is created for the kernel 
loop, which includes all pipeline stages.
230
Epilogue: New configuration contexts are created to flush suc­
cessive stages of the pipeline. For n  pipeline stages, n  — 1 
pipeline flushing contexts are created.
The core is dynamically reconfigured to first perform pipeline ini­
tialisation, then reconfigured to execute the kernel loop, then finally 
reconfigured to flush the pipeline— as demonstrated in figure 2. 








I Fill I Fill I 
1 Stage 1 j Stages I 
j 1 and 2 J
Kernel
Loop
2: Control flow for a 3-stage pipelined kernel, showing which 
stages are active in each context (and moment in time). Exe­
cution flows from one context to the next, except in the kernel 
loop, which loops back to itself (holding the same context) until 
the end condition is satisfied.
(a)
3: Expanded control How for the pipeline shown in figure 2 for
(a) 3, and (b) 6 iterations. The point at which the loop termi­
nation condition should evaluate to true Is shown by a dotted 
box. It can be seen in both cases that only the first stage has 
executed for the desired number of iterations by this point.
The configuration contexts generated for the kernel example from 
figure l is shown in figure 4. The use of separate special-purpose 
configurations alleviates the need for special logic for this purpose 
in the kernel loop configuration context, keeping its size down, and 
thus not compromising the potential parallelism in the core.
Figure 2 shows which stages of the pipeline are active during exe­
cution for a 3-stage pipeline. As the target architectures may not be 
state free (e.g. memory access), it is important to not allow any op­
eration in any pipeline stage to operate on garbage, and to preserve 
the execution count. With the arrangement shown in the figure, all 
pipeline stages will be executed the same number of times irrespec­
tive of the number of iterations performed in the kernel loop.
Q O  
Ü Ü
S I  P  
O o o è
(a) (b )
Q Q  O P  
P Q  Q po







8 8 8 8
8-88
(e)
4: The sequence of configuration contexts created for the ex­
ample kernel, (a) iteration 1— filling pipeline stage 1, (b) itera­
tion 2— filling pipeline stages 1 and 2, (c) iterations 3 to n — 2—  
pipeline full (loop), (d) iteration n - 1— flushing pipeline stage 
1, (e) iteration n— flushing pipeline stage 2.
Now consider the original kernel, where the ‘jum p’ operation causes 
the loop to terminate after n  iterations. In the pipelined kernel, we 
must ensure that the kernel loop terminates after n  executions of 
the operations that calculate the loop termination condition; other­
wise, the operations or operands would need to be modified to yield 
a different Iteration count. Looking at figure 2, the minimum num­
ber of iterations possible in the pipelined design occurs when the 
kernel loop context executes only once. This corresponds to an it­
eration count equal to the number of pipeline stages (in this case 3). 
In order for the loop to terminate immediately, the operations that 
determine the loop termination condition must have been executed 
this number of times by the time the kernel loop context has been 
executed. This can only be achieved by placing these operations 
in the first pipeline stage. The same argument also applies for any 
higher iteration count. Figure 3 shows two examples, to highlight 
this point.
Placing the ‘jum p’ in the first pipeline stage therefore requires that 
all of its dependencies are also placed in the first pipeline stage. 
Since the pipeline filling contexts (prologue) should always be ex­
ecuted in sequence (with no branching), the ‘jum p’ operation is 
omitted from these contexts, even though it is in a pipeline stage
231
active in those contexts. Its dependencies are left in place, since 
their side effects are important—e.g. they could update the itera­
tion counter whose value is used to determine the loop termination 
condition.
3.3 Autom ating the choice o f tim ing constraint
5: Idle time resulting from the master clock. The shorter
the critical path of the kernel, the more effect this has. This 
particularly affects pipelined kernels.
The arbitrary operation chaining supported by the target architec­
tures leads to a great variation in critical path length in different 
configuration contexts, as paths can be constructed involving long 
chains of a varying number of cells, and each type of cell has a dif­
ferent combinatorial delay. Ideally, each iteration of the configura­
tion context should be allowed to persist for the time required for 
the results to stabilise on the operation(s) that lie at the end of the 
critical path. In order to avoid the overhead of asynchronous logic, 
a master clock is normally used instead, and the iteration ends on 
the next master clock cycle after the last results have stabilised, as 
can be seen in figure 5. To minimise the resulting idle time between 
these two events, it is desirable to minimise the period of the master 
clock. However, high clock frequencies come at the cost of power 
consumption and area. Therefore, a suitable compromise has to be 
made.
Since pipelining reduces the critical path length of each iteration of 
the kernel loop configuration context, the quantisation introduced 
by the master clock frequency affects pipelined contexts more. There­
fore. it is important to minimise the wasted time between the crit­
ical path stabilising and the next master clock cycle. This fact is 
used to aid the automatic choice of the timing constraint.
The timing constraint is initially chosen to be the minimum possi­
ble critical path length that a pipeline stage can consist of. This is 
determined by the length of certain data paths that cannot be split 
across pipeline stages. These include the jump condition logic de­
termining when to finish the loop, and feedback loops that update 
a register or memory location (where that register or memory lo­
cation is both read from and written to in the same kernel). The 
one with the longest critical path length is selected, and the value 
rounded up to the next integer multiple of the master clock period.
Then, pipeline stage allocation is performed using this critical path 
constraint. If a valid pipeline could be constructed, register allo­
cation is performed. If there are sufficient registers available, then 
this pipeline geometry is used, since it will result in the highest pos­
sible iteration rate. Otherwise, the timing constraint is incremented
by one master clock period, and the process continues. A natural 
end point exists where this value reaches the critical path of the 
non-pipelined kernel. If reached, the context is left non-pipelined.
For completely automatic pipelining, feedback-directed optimisa­
tion is used. The program is first executed in a simulator prior to 
pipelining, and profiling information is fed back into the compiler. 
Basic blocks that loop to themselves are identified, and where suf­
ficient resources exist in the core to map the entire block into one 
configuration context, these are potential candidates for pipelining. 
The number of consecutive iterations of each candidate is deter­
mined through the profiling results. The minimum consecutive it­
erations for a kernel defines the maximum depth to which it can be 
pipelined: the pipeline depth must not be less than the minimum 
execution count. This is used as a test during each iteration of the 
timing constraint selection algorithm, where a potential pipeline is 
checked for its depth not exceeding the minimum iteration count. If 
it does, then the geometry is considered invalid, and the algorithm 
continues with a larger timing constraint. To take into account the 
cost of loading the new configurations from memory, the minimum 
iteration count value is artificially reduced by an arbitrary count, to 
weigh the algorithm in favour of only pipelining loops with signif­
icant iteration counts.
4. APPLICATION TO STR EA M IN G
The algorithm described in this paper was applied to two real- 
life applications: a 7-line Hamilton demosaic filter [16], and a 
multiplication-based iterative software division algorithm. The de­
mosaic involves interpolating missing colour components from the 
Bayer output of a colour filter array sensor. Division on a per-pixel 
level is used as part of many commercial noise reduction filters. 
Both are high-throughput tasks normally done on-chip as part of a 
custom image signal processing (ISP) pipeline, used in modern dig­
ital cameras and mobile phones. Both kernels were implemented on 
a reconfigurable instruction cell-based processor [5] (180nm timing 
figures), using the C language. Software optimisation techniques 
were used to reduce the main kernel in each case into a basic block 
small enough to fit onto the target architecture in a single configu­
ration context. Both example kernels produce a single output pixel 
per iteration.
The performance of the pipelining for both cases is shown in fig­
ure 6, and some additional details are given for the Hamilton de­
mosaic in table 1.
The main trend to notice is the ability for the maximum achievable 
iteration rate (after pipelining) to generally increase as the master 
clock frequency is increased. Since the same underlying data path 
is used in each case, the non-pipelined critical path length is con­
stant. The iteration time of the non-pipelined data paths is just the 
critical path length rounded up to the next integer multiple of the 
master clock period. As the master clock period is decreased, the 
algorithm is able to produce a pipeline with a critical path closer to 
the theoretical minimum (as dictated by the indivisible data paths 
such as feedback loops, and the jump condition chain). However, 
the number of pipeline stages required to do this increases in a 
faster than linear fashion. This is due to quantisation: the error be­
tween the time taken for each data path fragment in each pipeline 
stage to complete and the closest integer multiple of the master 
clock frequency. As the pipeline stages get shorter, the relative 
size of the indivisible units being pipelined (i.e. the internal delays 
of each cell and section of interconnect) increases compared to the 
resolution of the master clock. The algorithm does well in minimis-
232
Master clock period (ns) 20.0 15.0 10.0 5.0 3.0 2.0 1.0
Pipeline stages 5 7 5 7 9 9 11
Pipeline stage registers 80 123 80 123 153 153 189
Min. possible constraint (ns) 10.95 10.95 10.95 10.95 10.95 10.95 10.95
Non-pipelined critical path (ns) 77.0 77.0 77.0 77.0 77.0 77.0 77.0
Pipelined critical path (ns) 19.8 14.65 19.8 14.65 11.55 11.55 11.00
Improvement in critical path 389% 526% 389% 526% 667% 667% 700%
Non-pipelined iteration time (ns) 80.0 90.0 80.0 80.0 78.0 78.0 77.0
Pipelined iteration time (ns) 20.0 15.0 20.0 15.0 12.0 12.0 11.0
Improvement in iteration time 400% 600% 400% 533% 650% 650% 636%
Pipelined throughput (MPixels/s) 50.0 66.6 50.0 66.6 83.3 83.3 90.9
1: Performance of the demosaic filter kernel before and after automatic pipelining, over a range of master clock periods. See 




















M a s te r c lock period (ns)
6: Throughput before and after automatic pipelining, over a range of master clock periods, for two pixel-level code examples: 
Hamilton demosaic and iterative software division. The theoretical line shows what could be achieved if the master clock were of 
infinite frequency, based on the longest indivisible critical path (the iteration control logic in both of these cases).
ing this effect, and the percentage improvements with and without 
the effect of the master clock are relatively close in all cases.
The pipeline geometries contructed for each master clock frequency 
setting are shown in figure 7. Both examples show identical post- 
pipelining throughput (iteration rate), as both cases have the same 
longest indivisible critical path— corresponding to the iteration con­
trol (jump) logic (shown by the theoretical line in figure 6). There 
are no data dependencies or other constraints limiting the potential 
for pipelining in either example. If data dependencies, feedback 
loops, or other constraints were present, these would be reflected by 
a larger indivisible critical path. The shorter the indivisible critical 
path, the more important the behaviour of the automatic pipelining 
algorithm.
The resource-saving effect of the algorithm can be seen to come 
into effect each time the current integer multiple of the master clock 
frequency drops below the indivisible critical path length. This 
makes the iteration rate curve appear to wrap around each time it 
tries to cross the theoretical maximum iteration rate line. By ex­
tending the length of the pipeline stages up to the next master clock 
period, the number of registers is minimised, which avoids need­
less congestion on the interconnect. The reduction in the number
of pipeline stages reduces the configuration size and the latency, 
since fewer filling and flushing iterations need to be performed.
5. C O N C LU SIO N S
This work proposed an algorithm for automatically applying dy­
namic structural-level pipelining to single configuration context ker­
nels running on dynamically reconfigurable arrays (DRAs). The 
technique is a form of feedback directed optimisation, where pro­
filing information (consecutive execution counts) are used to deter­
mine which kernels will benefit from pipelining. Candidates with 
very low consecutive execution counts must not be pipelined too 
deeply. This is to ensure that the additional latency of pipeline 
filling and flushing is more than nullified by the decrease in total 
execution time for the pipelined kernel loop when the pipeline is 
full. This is only possible when the minimum possible iteration 
count is known. This is the case for pixel-level kernels in the ISP 
application domain, as the iteration count is typically the line size 
of the image.
An iterative approach is used to form an efficient pipeline, where 
the liming constraint is automatically chosen to be an integer mul­
tiple of the master clock frequency. The timing constraint is incre-
233
7: Pipeline geometry from automatic pipelining, over a range of master clock periods, for two pixel-level code examples: Hamilton 
demosaic and iterative software division.
merited until a valid pipeline can be constructed without encoun­
tering register starvation. The range of possible pipeline geome­
tries is controlled by the availability of registers. Architectures 
with distributed registers will offer the best results, otherwise the 
bandwidth of the interface and/or additional combinatorial delays 
introduced by routing to and from a register file would likely out­
weigh any benefit. This makes the case for registers to be made 
available in the interconnect itself.
The algorithm was applied to a demosaic kernel of modest com­
plexity and to a software division algorithm, leading to the possi­
bility to pipeline to a significant depth. A performance increase 
of up to 7 times can be obtained for the demosaic example, and 
nearly 10 times for the division. As the pipeline gets deeper, the 
cost—in terms of register requirement and storage for pipeline fill­
ing and flushing contexts—increases more than linearly. As the 
critical path of the pipelined kernel gets smaller, the quantisation 
■of the iteration rate caused by the master clock, gets increasingly 
worse. Inside the bounds of this quantisation, reducing the pipeline 
critical path (by increasing the number of pipeline stages) has no ef- 
lecl on the iteration rate. In these situations, extra resources would 
be introduced for no benefit. To avoid this, the proposed algo­
rithm relaxes the critical path to take into account this quantisation, 
thus minimising the resource requirements for a given physically 
achievable iteration rate.
6. R E F E R E N C E S
[I] M. Butts, A. M. Jones, and P. Wasson, “A structural object 
programming model, architecture, chip and tools for 
reconfigurable computing,” in FCCM, 2007, pp. 55-64.
[21 B. Khailany, T. Williams, J. Lin, E. Long, M. Rygh,
D. Tovey, and W. Daly, “A programmable 512 GÔPS stream 
processor for signal, image, and video processing," in 
Solid-State Circuits Conference, 2007, pp. 272-602.
[3] A. Major, T. Arlsan, et al., "H.264 decoder implementation 
on a dynamically reconfigurable instruction cell based 
architecture,” in International SOC Conference, 2006, on 
49-52.
[4] Z. Khan, T. Arlsan, et al., “Implementation of a real time 
programmable encoder for low density parity check code on 
a reconfigurable instruction cell architecture," in Design 
Automation Conference. Asia and South Pacific, 2001, pp.
583-588.
[5] S. Khawam, I. Nousias, M. Milward, Y. Yi, M. Muir, and 
T. Arslan, “The reconfigurable instruction cell array,” IEEE 
Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 16, no. l,pp . 1-11,2008.
[6] “Loosely-biased heterogeneous reconfigurable arrays,” U.S. 
Patent 20 050 257 024, 2005.
[7] M. Weinhardt and W. Luk, “Pipeline vectorization,” IEEE 
Trans. Comp.-Aid. Des. In leg. Circ. and Syst., vol. 20, no. 2, 
pp. 234-248, 2001.
[8] J. Liao, W. Wong, and T. Mitra, “A model for hardware 
realization of kernel loops,” in Field Programmable Logic, 
International Conference on, 2003, pp. 334—344.
[9] R. Rodrigues and J. Cardoso, “Pipelining sequences of 
loops—a first example,” in ARC, Workshop, 2005, pp. 
147-151.
[10] Y. Yi and I. Nousias, “System-level scheduling on instruction 
cell based reconfigurable systems,” in Design Automation 
and Test in Europe, International Conference on, 2006, pp. 
381-386.
[11] J. Sanchez and A. Gonzalez, “The effectiveness of loop 
unrolling for modulo scheduling in clustered VLIW 
architectures," in I ('CP Parallel Processing, International 
Conference on, 2000, p. 555.
[12] S. Bakshi and D. Gajski, “Partitioning and pipelining for 
performance-constrained hardware/software systems,” Very 
Large Scale Integration (VLSI) Systems, IEEE Transactions 
on, vol. 7, no. 4, pp. 419-432, 1999.
[13] S. Silva and S. Bampi, “Area and throughput trade-offs in the 
design of pipelined discrete wavelet transform architectures,” 
in Design Automation and Test in Europe, International 
Conference on, 2005, pp. 32-37.
[14] M. Muir, T. Arslan, and I. Lindsay, “Automated dynamic 
throughput-constrained structural-level pipelining in 
streaming applications,” in Design Automation and Test in 
Europe, international conference on, 2008, pp. 1358-1361.
[15] M. Lam, “Software pipelining: an effective scheduling 
technique for VLIW machines,” in ACM SIGPLAN 
conference on Programming Language design and 
Implementation. New York, NY, USA: ACM Press, 1988, 
pp. 318-328.
[16] R. Ramanath, W. Snyder, and G. Bilbro, “Demosaicking 
methods for bayer color arrays,” Electronic Imaging, vol. 11, 
no. 3, pp. 306-315, 2002.
234
References
[1] M. Butts, A. M. Jones, and P. W asson, “A structural object program m ing m odel, architecture, ch ip  and tools 
for reconfigurable com puting,” in FCCM , 2007, pp. 55 -64 .
[2] B. Khailany, T. W illiam s, J. Lin, E. Long, M. Rygh, D. Tovey, and W. Daly, "A program m able 512 G O PS 
stream  processor for signal, im age, and video processing,” in Solid-Sta te C ircuits C onference , 2007, pp. 
272-602 .
[3] A. M ajor, T. A rlsan, et al., “ H .264 decoder im plem entation on a dynam ically  reconfigurable instruction cell 
based architecture,” in In ternational SO C  Conference, 2006, pp. 4 9 -5 2 .
[4] Z. Khan, T. Arlsan, et al., “ Im plem entation o f  a real tim e program m able encoder for low density  parity 
check code on a reconfigurable instruction cell architecture,” in D esign A utom ation Conference, A sia  and  
South Pacific, 2007, pp. 583-588 .
[5] S. K haw am , I. N ousias, M. M ilw ard, Y. Yi, M. Muir, and T. A rslan, "T he reconfigurable instruction cell 
array,” IEE E  Transactions on Very Large Scale Integration (VLSI) System s, vol. 16, no. 1, pp. 1-11, 2008.
[6j “L oosely-biased heterogeneous reconfigurable arrays,” U .S. Patent 2 0 0 5 0  257 024, 2005.
[7] M. W einhardt and W. Luk, “Pipeline vectorization,” IE E E  Trans. Comi).-Aid. Des. Inleg. Circ. a n d  Syst., 
vol. 20, no. 2, pp. 234 -2 4 8 , 2001.
[8] J. L iao, W. W ong, and T. M itra, “A m odel for hardw are realization o f  kernel loops,” in Field Program m able  
Logic, In ternational C onference on, 2003, pp. 334 -344 .
[9] R. R odrigues and J. C ardoso, “ P ipelining sequences o f loops— a first exam ple,” in ARC, W orkshop, 2005, 
pp. 147-151.
[10] M . Muir, T. Arslan, and I. Lindsay, “A utom ated dynam ic th roughput-constrained structural-level pipelining 
in stream ing applications,” in Design A utom ation  a n d  Test in Europe, in ternational conference on, 2008, pp. 
1358-1361.
[11] M. M uir, N. Aslam , I. N ousias, A. M ajor, T. A rslan, and I. Lindsay, “A utom atic dynam ic structural-level 
pipelin ing  in reconfigurable processors,” in D esign a n d  Architectures f o r  S ignal a n d  Im age Processing, co n ­
feren ce  on, 2008, pp. 222-228 .
[12] M. M uir, I. Lindsay, T. A rslan, I. N ousias, S. K haw am , M. M ilw ard, N. A slam , and A. M ajor, “E xtensib le 
softw are em ulator for reconfigurable instruction cell based processors,” in SO C  conference, IE E E  in terna­
tional conference on, 2008, pp. 35 -40 .
[13] A. D eH on and J. W aw rzynek, “R econfigurable com puting: what, why, and im plications for design au tom a­
tion,” in D AC  ’99: Proceedings o f  the 36th A C M /IE E E  conference on D esign autom ation. New York, NY, 
USA: ACM , 1999, pp. 610-615 .
[14] J. Henkel, “C losing the SoC design gap,” IEE E  com puter, 2003.
[15] K.-C. Wu and Y.-W. Tsai, “Structured A SIC, evolution or revolution?” in ISPD  ’04: Proceedings o f  the 2004  
international sym posium  on P hysical design. N ew  York, NY, USA: ACM , 2004, pp. 103-106.
[16] D. Lewis, E. A hm ed, G. B aeckler, V. Betz, B ourgeault, el ah, “The Stratix II logic and routing architecture,” 
in FPGA '05: Proceedings o f  the 2005 ACM /SIG D A 13tli international sym posium  on Field-program m able  
gate arrays. New York, NY, USA: ACM , 2005, pp. 14-20.
[17] L. Shang, A. S. Kaviani, and K. B athala, “D ynam ic pow er consum ption  in V irtex-II FPG A  family,” in FPGA  
’02: Proceedings o f  the 2002 ACM /SIG D A tenth international sym posium  on F ield-program m able gate ar­
rays. New York, NY, USA: ACM , 2002, pp. 157-164.
[18] A. G ayasen, N. V ijaykrishnan, and M. Irwin, “Exploring  technology alternatives for nano-scale FPG A  inter­
connects,” June 2005, pp. 921-926 .
[19] "1 G H z field program m able object array overview,” M athStar, 2007.
[20] P. C hiang and S. Riley, “Using a field program m able object array (FPOA ) to accelerate im age processing,” 
in Real-Tim e im age processing, vol. 6063, no. 1. SPIE, 2006, p. 60630E.
235
R eferences
[21] K. U nderw ood and K. Hem m ert, “Closing the gap: CPU and FPGA trends in sustainable floating-point 
BLAS perform ance,” April 2004, pp. 219-228.
[22] M. H erbordt, T. VanCourt, Y. Gu, and B. Sukhwani, “Achieving high perform ance with FPG A -based com ­
puting,” Computer: IEEE com putiing society, 2007.
[23] J. Rice, K. Pace, M. Gales, G. M orris, and K. Abed, “Reconfigurable com puter application design consider­
ations,” April 2008, pp. 236-243.
[24] P. M artin, M. Sm ith, S. Alam, and P. Agarwal, “Im plem entation m ethodology for em erging reconfigurable 
systems,” Aug. 2008, pp. 169-172.
[25] P. Claydon, “M ulticore future is right now,” in EE Thnes-Asia , 2007.
[26] J. Sm ith and G. Sohi, “The m icroarchitecture o f superscalar processors,” Proceedings o f  the IEEE , 1995.
[27] J. Stark, M. Evers, and Y. N. Patt, “Variable length path branch prediction,” SIG PLAN Not., vol. 33, no. 11, 
pp. 170-179, 1998.
[28] J. Farrell and T. Fischer, “ Issue logic for a 600-M H z out-of-order execution m icroprocessor,” Solid-State  
Circuits, IEEE  Journal of, vol. 33, no. 5, pp. 707-712 , May 1998.
[29] M. Lipasti and J. Shen, “Superspeculative m icroarchitecture for beyond AD 2000,” Computer, vol. 30, no. 9, 
pp. 5 9 -66 , Sep 1997.
[30] P. Hoare, A. Jones, D. Kusic, et al., "Rapid VLIW  processor custom ization for signal processing applications 
using com binational hardware functions,” EU RASIP Journal on A pplied  Signal Processing, vol. 2006, 2006.
[31] T. H alfhill, “Silicon Hive breaks out,” M icroprocessor Report, no. 169, D ecem ber 2003.
[32] M. Lam, “Software pipelining: an effective scheduling technique for VLIW  m achines,” in AC M  SIG PLAN  
conference on Program ming Language design and Implementation. New York, NY, USA: ACM Press, 
1988, pp. 318-328.
[33] D. W. Wall, “L im its o f instruction-level parallelism ,” in The Fourth International Conference on Architectural 
Support f o r  Program ming Languages and  Operating Systems, 1991, pp. 176-188.
[34] W. A. W olf and S. A. M cKee, “Hitting the memory wall: im plications o f  the obvious,” in SIG RAPH  Comput. 
Archil. News, vol. 23, 1995, pp. 20-24.
[35] E. K ilgariff and R. Fernando, “T he GeForce 6 series GPU architecture,” in SIG G RAPH  '05: AC M  SIG- 
G RAPH  2005 Courses. New York, NY, USA: ACM, 2005, p. 29.
[36] J. Bolz, I. Farmer, E. Grinspun, and P. Schrooder, “Sparse matrix solvers on the GPU: conjugate gradients 
and m ultigrid,” in SIG G RAPH  '03: ACM  SIG G RAPH  2003 Papers. New York, NY, USA: ACM, 2003, pp. 
917-924.
[37] J. Fung and S. Mann, "U sing m ultiple graphics cards as a general purpose parallel com puter: applications to 
com puter vision,” vol. 1, Aug. 2004, pp. 805-808.
[38] A. Duller, D. Towner, and G. Panesar, “picoA rray technology: the tool’s story,” in Design, Autom ation and  
Test in Europe, vol. 3, 2005, pp. 1530-1591.
[39] "C oupling integrated circuits in a parallel processing environm ent,” U.S. Patent 7 539 845, 2006.
[401 X. Jia and R. Vemuri, “Using GALS architecture to reduce the im pact o f  long wire delay on FPGA perfor­
mance,’ in ASP-D AC ’05: Proceedings o f  the 2005 conference on A sia  South Pacific design  autom ation. 
New York, NY, USA: ACM, 2005, pp. 1260-1263.
[41] "X lcnsa processor,” Tensilica Inc. [Online], Available: http://w ww.tensilica.com
[42] “ARCtangent processor,” ARC Intl. [Online]. Available: http://www.arc.com
[43] “Stretch processor,” Stretch Inc. [Online], Available: http://w w.stretchinc.com
[44] B. Mei, S. Vctnaldc, D. Vcrkcsl, H. De Man, and R. Lauwcrcins, “ADRES: An architecture with lightly 
coupled VLIW  processor and coatse-grained reconfigurable matrix,” Field-Program m able Logic a n d  A pp li­
cations, pp. 61-70 , 2003.
[45[ L. Bauer, M. Shaliquc. and J. Henkel, "R un-tim e instruction set selection in a transm utable em bedded pro­
cessor,” in DAC '08: Proceedings o f  the 45th annual conference on Design autom ation  New York NY 
USA: ACM , 2008, pp. 56-61.
236
R eferences
[46] R. H artenstein, “A decade o f  reconfigurable com puting: a  visionary retrospective,” in D A TE  '01 : Proceedings  
o f  the conference on Design, autom ation a n d  test in Europe. Piscataway, NJ, USA : IEEE Press, 2001, pp. 
642 -6 4 9 .
[47] P. H eysters, G. Sm it, and E. M olenkam p, “A flexible and energy-efficient coarse-grained reconfigurable 
architecture for m obile system s,” The Journal o f  Supercom puting , vol. 26, no. 3, pp. 2 8 3 -3 0 8 , 2003.
[48] H. Schm itt, D. W helihan, et al., “PipeR ench: A virtualised program m able datapath in 0.18 m icron  technol­
ogy,” in C ustom  Integrated Circuits Conference, 2002, pp. 6 3 -6 6 .
[49] S. K haw am , “Reconfigurable architectures for low -pow er SoC: D om ain- specific and rica based system s,” 
Ph.D . dissertation, U niversity o f  Edinburgh, School o f  Engineering, apr 2006.
[50] P. Bellow s and B. H utchings, “JH D L  - an H D L for reconfigurable system s,” in F P G A sfo r  Custom C om puting  
M achines, IEE E  sym posium  on, 1998, pp. 175-184.
[51] D. Lau, O. Pritchard, and P. M olson, “A utom ated generation o f  hardw are accelerators w ith direct m em ory 
access from  A N SI/ISO  standard C functions,” April 2006, pp. 4 5 -5 6 .
[52] A. Takach, B. Bowyer, and T. Bollaert, “C based hardw are design for w ireless applications,” in D ATE '05: 
Proceedings o f  the conference on Design, A utom ation a n d  Test in Europe. W ashington, DC, USA: IEEE 
C om puter Society, 2005, pp. 124—129.
[53] K. H am m ond and G. M ichaelson, “Bounded space p rogram m ing using finite sta te m achines and recursive 
functions: the hum e approach,” in A C M  Transactions on Softw are Engineering a n d  M ethodology (TO SEM ), 
2006.
[54] A. Hoffm an, T. Kogel, and H. Meyr, “A fram ew ork for fast hardw are-softw are co-sim ulation,” in D esign  
A utom ation  a n d  Test in Europe, international conference on, 2001, pp. 760-764 .
[55] A. H alam bi, P. G run, et a l., “EX PR ESSIO N : A language for architecture exploration through com ­
piler/sim ulator retargetability,” in D esign A utom ation a n d  Test in Europe, international conference on, 1999, 
pp. 485-490 .
[56] E. R. A ltm an, S. Sathaye, and M . G schw ind, “E xecution-based scheduling  for V LIW  architectures,” in Euro- 
P a r'99— Parallel Processing, international conference on, 1999, pp. 1269-1275.
[57] Y. Yi and I. N ousias, “System -level scheduling on instruction cell based reconfigurable system s,” in D esign  
A utom ation  and  Test in Europe, international conference on, 2006, pp. 381-386 .
[58] D. N ovillo, “TreeSSA  - a new  high-level optim isation  fram ew ork for the GN U  com piler collection,” 2003.
[59] ------- , “Design and im plem entation o f T reeSSA ,” 2004.
[60] C. Lattner and V. Adve, “T he LLVM com piler fram ew ork and infrastructure tutorial,” in M ini W orkshop on 
C om piler Research Infrastructures (LC P C '04 ), 2004.
[61] I. N ousias, “Reconfigurable com puting: T he reconfigurable instruction cell array: Reconfiguration and in ter­
connects,” Ph.D. dissertation, U niversity o f  Edinburgh, School o f  Engineering, apr 2009.
[62] L. Robertson, “A necdotes,” A nnals o f  the H istory o f  Com puting, IEEE, vol. 27, no. 2, pp. 8 2 -8 4 , 2005.
[63] H. D iab and I. D em ashkich, “A reconfigurable m icroprocessor teaching tool,” in Science, M easurem ent and  
Technology, IEE  Proceedings, 1990, pp. 287-292 .
[64] C. C ooper and P. W erstein, “T he use o f  Java to develop a m icroprocessor em ulator,” in Softw are Engineering: 
E ducation and  Practice, 1998, pp. 272-277 .
[65] W. Z aatar and G. E. Nasr, “An im plem entation schem e for a m icroprocessor em ulator,” in IC ECS Electronics, 
Circuits and  System s, 7th international conference on, 2000, pp. 169-172.
[66] S. Bush, "A R M  offers real-tim e prototyping capability,” Electronics Weekly, no. 39339, July 2006.
[67] R. C ohn and P. G. Lowncy, “Feedback directed optim isation in Compaq’s com pilation tools for A lpha,” in 
Proceedings o f  the 2nd  A C M  W orkshop on Feedback-directed optim isation , 1999.
[68] M. G schw ind, V. Salapura, and D. M aurer, “FPGA prototyping o f  a R ISC  processor core fo r em bedded 
applications,” IEE E  Transactions on VLSI System s, pp. 241-2 5 0 , 2001.
[69] Y. N akam ura and K. H osokaw a, “Fast FPGA em ulation based sim ulation environm ent for custom  proces­
sors,” IE IC E  transactions on fun d a m en ta ls  o f  electronic com m unications in com puter science, vol. E89-A , 
pp. 3464-3470 , 2006.
237
R eferences
[701 S. Fink and E. Sanchez, “Developm ent and prototyping for an 8-bit m ultitask m icropow er processor,” in 
Proceedings o f  the 6tli IEEE  International Workshop on Rapid System Prototyping , 1995, pp. 7 5 -78 .
[71] N. D. Jones, C. K. G om ard, and P. Sestoft, Partial Evaluation and Autom atic Program Generation. Prentice 
Hall International, 1993.
[72] D. A braham s and A. Gurtovoy, C + +  Template M etaprogramming: Concepts, Tools, and Techniques from  
Boost and B eyond (C++ in Depth Series). Addison-W esley Professional, 2004.
[73] The G NU com piler collection (open source). The free softw are foundation. [Online], Available: 
http://gcc.gnu.org
[74] H. N ilsson, “Porting GCC for dunces,” Axis C om m unication, 2000. [Online], Available: http: 
/ /ftp.axis.com /pub/users/hp/pgccfd/pgccfd-0.5.pdf
[75] M. Ganguin, M. Schinz, P. Mudry, and A. Ijspeert, “GCC back-end for the U lysse processor,” 2007. 
[Online], Available: http://birg.epfl.ch/w ebdav/site/birg/users/146738/public/m ovegcc.pdf
[76] L. V. Put, D. Chanet, B. D. Bus, B. D. Sutler, and K. D. Bosschere, “DIABLO: a reliable, retargetable and 
extensible link-tim e rew riting framework,” in Proceedings o f  the 5tli International Sym posium  on Signal 
Processing and Inform ation Technology, 2005, pp. 7-12.
[77] G. G allo, G. Longo, S. Palloltino, and S. Nguyen, “Directed hypergraphs and applications,” D iscrete A pplied  
M athem atics, vol. 42. no. 2-3, pp. 177-201, 1993.
[78] J. A. D eR osa and H. M. Levy, “An evaluation o f branch architectures,” in 1SCA ’87: Proceedings o f  the 14th 
annual international sym posium  on Com puter architecture. New York, NY, USA: ACM, 1987, pp. 10-16.
[79] V. Bala and N. Rubin, “Efficient instruction scheduling using finite state autom ata,” International Journal o f  
Parallel Program ming, vol. 25, no. 2, 1997.
[80] P. Duhamel and C. Guillem ot, “Polynomial transform  com putation o f the 2-D  DCT,” in Acoustics, Speech 
and Signal Processing (ICASSP), international conference on, vol. 3, 1990, pp. 1515-1518.
[81] W. Kao, S. Wang, L. Chen, and S. Lin, “Design considerations o f color im age processing pipeline for digital 
cam eras,” IEEE Transactions on Consum er Electronics, vol. 52, no. 4, pp. 1144-1152, nov 2006.
[82] A. Chihoub, Y. Bai, and V. Ramesh, “An im aging library for a TriCore based digital cam era,” in Proceedings 
o f  the 5tli IEEE international workshop on com puter architectures fo r  m achine perception (CAM P), 2000, 
pp. 3 -11 .
[83] D. Coffin, dcraw project (open source). [Online], Available: http://w w w .cybercom .net/~dcoffin/dcraw /
[84] G. C. Fox, “W hat have w e learnt from using real parallel m achines to solve real problem s?” in Proceedings 
o f  the third conference on Hypercube concurrent com puters and applications. New York, NY, USA: ACM, 
1988, pp. 897-955.
[85] S. Note, F. Catthoor, et al., “Com bined hardware selection and pipelining in high-perform ance data-path 
design,” Com puter-Aided Design o f  Integrated Circuits and  Systems, IEEE  Transactions on, vol. 11, no. 4, 
pp. 413-423 , 1992.
[86] S. Bakshi and D. Gajski, “Partitioning and pipelining for perform ance-constrained hardw are/softw are sys­
tems,” Very Large Scale Integration (VLSI) Systems, IEE E  Transactions on, vol. 7, no. 4, pp. 419-4 3 2 , 1999.
[871 S. Bakshi and D. D. Gajski, "Perform ance-constrained hierarchical pipelining for behaviors, loops, and op­
erations,” ACM  Trans. Des. Autom. Electron. Syst., vol. 6, no. 1, pp. 1-25, 2001.
[88] C. Leiserson and J. Saxe, “Retim ing synchronous circuitry,” Algorithm ica, vol. 6, no. 1, pp. 5 -35 , 1991.
[89] S. Bakshi, D. Gajski, and H. Juan, "C om ponent selection in resource shared and pipelined DSP applica­
tions, in EURO-DAC/EURO-VHDL: Proceedings o f  the conference on European design autom ation. Los 
Alamitos, CA, USA: IEEE Com puter Society Press, 1996, pp. 370-375.
[90] S. Silva and S. Bampi, Area and throughput trade-offs in the design of pipelined discrete wavelet transform  
architectures,” in Design Autom ation and Test in Europe, International Conference on, 2005, pp. 32 -37 .
[91] C. Soviani, I. Hadzic, and S. Edwards, "Synthesis o f high-perform ance packet processing pipelines,” in 
Design Autom ation and Test in Europe, International Conference on, 2006, pp. 679-682 .
[92] J. Sanchez and A. Gonzalez, “The effectiveness o f  loop unrolling for m odulo scheduling in clustered VLIW  
architectures,” in ICCP Parallel Processing, International Conference on, 2000, p. 555.
238
R eferences
[93] J. H. Patel and E. S. D avidson, “ Im proving the throughput o f  a p ipeline by insertion o f delays,” SIG ARC H  
Com put. Archit. N ew s , vol. 4, no. 4, pp. 159-164, 1976.
[94] B. R. Rau and J. Fisher, “ Instruction-level parallel processing: History, overview, and perspective,” The 
Journal o f  Supercom puting, vol. 7, no. 1, pp. 9 -5 0 , 1993.
[95] F. Warg and P. Stenstrom , “ L im its on speculative m odule-level parallelism  in im perative and object-oriented 
program s on C M P platform s,” in Parallel Architectures a n d  C om pilation Techniques, C onference on, 2001, 
pp. 221-230 .
[96] “Execution unit chaining for single cycle extract instruction having one serial shift left and one serial shift 
right execution units,” U.S. Patent 6 061 780, 2000.
[97] J. M ukherjee, M. M oore, and S. M itra, “C olor dem osaicing  with constrained buffering,” in Signal Processing  
a n d  its Applications, Sixth International Sym posium  on, vol. 1, 2001, pp. 52 -55 .
[98] R. Ram anath, W. Snyder, and G. B ilbro, “ D em osaicking m ethods for bayer color arrays,” Electronic Im aging, 
vol. 11, no. 3, pp. 306-315 , 2002.
[99] F. Brandner, A. Fellnhofer, A. Krall, and D. Ricgler, “Fast and accurate sim ulation using the LLVM com ­
p iler fram ew ork,” in Proceedings o f  the R apid  Sim ulation and  Perform ance Evaluation: M ethods a n d  Tools, 
R A P ID O ’09, 2009.
[100] J. eun Lee, K. C hoi, and N. Dutt, “C om pilation approach for coarse-grained reconfigurable architectures,” 
D esign and  Test o f  Com puters, IE E E , vol. 20, no. 1, pp. 2 6 -33 , Jan-Feb 2003.
[101] J. M errill, "G EN ER IC  and G IM PLE: A new tree representation for entire functions,” 2003.
[102] S. C allanan, D. Dean, and E. Zadok, “Extending G C C  with m odular G IM PL E  optim isations,” 2003.
[103] M. Sanchez-E lez, M. Fernandez, R. M aestre, F. Kurdahi, R. H erm ida, and N. B agherzadeh, “A com plete 
data scheduler for m ulti-context reconfigurable architectures,” in D ATE ’02: Proceedings o f  the conference  
on Design, autom ation a n d  test in Europe. W ashington, DC, USA: IEEE C om puter Society, 2002, p. 547.
[104] S. Lee, D. Raila, and V. K indratenko, “LLV M -C H iM PS - com pilation environm ent for FPG A s using LLVM 
com piler infrastructure and CFIiMPS,” in Proceedings o f  the 4th annual reconfigurable system s sum m er  
institute R SSI'08 , 2008.
239
