Loop Transformations for the Optimized Generation of Reconfigurable Hardware by Devos, Harald
Lustransformaties voor de geoptimaliseerde generatie
van herconfigureerbare hardware
Loop Transformations for the Optimized Generation 
of Reconfigurable Hardware
Harald Devos
Promotoren: prof. dr. ir. D. Stroobandt, prof. dr. ir. J. Van Campenhout
Proefschrift ingediend tot het behalen van de graad van 
Doctor in de Ingenieurswetenschappen: Elektrotechniek
Vakgroep Elektronica en Informatiesystemen
Voorzitter: prof. dr. ir. J. Van Campenhout
Faculteit Ingenieurswetenschappen
Academiejaar 2007 - 2008  
ISBN 978-90-8578-191-2
NUR 959 
Wettelijk depot: D/2008/10.500/10
Promotoren
prof. dr. ir. D. Stroobandt
prof. dr. ir. J. Van Campenhout
Universiteit Gent
Faculteit Ingenieurswetenschappen
Vakgroep Elektronica en Informatiesystemen (ELIS)
Onderzoeksgroep Parallelle Informatiesystemen (PARIS)
Sint-Pietersnieuwstraat 41
B-9000 Gent
Belgie¨
De auteur genoot tijdens zijn onderzoeksactiviteiten van een beurs als
FWO-aspirant.
Erkenntnis is nicht aus Lehren zu vermitteln,
sondern kann nur durch eigene Erfahrung erworben werden.
Hermann Hesse
Dankwoord
Acknowledgements
Dit werk zou niet mogelijk geweest zijn zonder de de financie¨le on-
dersteuning van het FWO (Fonds voor Wetenschappelijk Onderzoek –
Vlaanderen) en het IWT (Instituut voor de aanmoediging van Innovatie
door Wetenschap en Technologie in Vlaanderen).
Verder wil ik mijn promotoren bedanken omdat ze me de kans ga-
ven aan deze universiteit onderzoek te verrichten, me hierbij de vrij-
heid gaven om mijn eigen weg in de onderzoekswereld te zoeken en
richting gaven wanneer ik dreigde verloren te lopen.
Voor de dagelijkse werking van een onderzoeksgroep zijn postdocs
minstens zo belangrijk als professoren. Mark, bedankt voor de voor-
treffelijke leiding van het RESUME-project, het van tijd tot tijd mee-
brengen van ovenexperimenten en voor het feit dat we altijd je bu-
reau konden binnenwandelen met de meest uiteenlopende problemen
of ideee¨n. Kristof Beyls introduceerdeme in de wereld van lustransfor-
maties en het polyedrischemodel. Hij namme mee naar de doctoraats-
verdedigingen van Sven Verdoolaege en Sylvain Girbal — de eerste na
een onvergetelijke treinreis — die ik nog steeds als mijlpalen in mijn
onderzoek beschouw.
Sven Verdoolaege wil ik bedanken voor de vruchtbare discussies
omtrent het benaderen van quasi-veeltermen en het implementeren van
een deel van de technieken beschreven in hoofdstuk 4.
Many thanks go to Sylvain Girbal, Albert Cohen and Nicolas Vasi-
lache for providing the WRaP-IT/URUK tool suite and giving support.
I also would like to thank Ce´dric Bastoul for writing CLooG. I hope we
will meet soon.
Dank aan alle collega’s en oud-collega’s van PARIS voor het nalezen
ii
van allerhande schrijfsels, het ontwikkelen van software-drivers en re-
gressietesten, het uitvoeren van systeemintegratie en vermogensmetin-
gen, het aan de praat krijgen van nieuwe hardware-platformen, het
animeren van review- en andere vergaderingen, het uitwisselen van
ideee¨n en het brainstormen over uiteenlopende onderwerpen, het door-
geven van scriptjes, artikels, templates, . . . , en bovenal de aangename
werksfeer. Hendrik en Philippe wens ik een succesvolle toekomst met
Sigasi. Ook aan alle anderen aangename en vruchtbare werk- en on-
derzoekservaringen toegewenst.
Ronny, Michiel en Wim verdienen een bedankje voor het onder-
houden van de hardware-, software- en netwerkinfrastructuur. Jeroen,
Marnix, Rita en Wouter voor de hulp bij administratieve en logistieke
beslommeringen en mijn speurtochten door de bibliotheek.
De leden van de OAP-vergadering en het organiserend comite´ van
het doctoraatssymposium lieten me toe mijn horizon te verbreden bui-
ten het technisch-wetenschappelijke domein. Dankzij het GUK (Gents
Universitair Koor) waren de voorbije jaren aan de universiteit ook een
culturele verrijking. Onnodig te vermelden dat dit ook een uitbreiding
van de vriendenkring met zich meebracht.
Vrienden en familie wil ik bedanken voor hun interesse en steun
op alle vlakken. Mijn ouders gaven me de kans zonder te veel andere
bekommernissen aan dit doctoraat te werken.
Ten slotte wil ik Krista bedanken omdat ik alles met haar mocht en
mag delen.
Harald Devos
Gent, 11 februari 2008
Examencommissie
prof. Ronny Verhoeven, voorzitter
Onderwijsdirecteur Faculteit Ingenieurswetenschappen
Universiteit Gent
prof. Erik D’Hollander, secretaris
Vakgroep Elektronica en Informatiesystemen (ELIS)
Faculteit Ingenieurswetenschappen
Universiteit Gent
prof. Albert Cohen
ALCHEMY Group, INRIA Saclay Iˆle de France
Parc Club Orsay Universite´
France
prof. Leo Storme
Vakgroep Zuivere Wiskunde en Computeralgebra
Faculteit Wetenschappen
Universiteit Gent
prof. Dirk Stroobandt, promotor
Vakgroep Elektronica en Informatiesystemen (ELIS)
Faculteit Ingenieurswetenschappen
Universiteit Gent
prof. Jan Van Campenhout, promotor
Vakgroep Elektronica en Informatiesystemen (ELIS)
Faculteit Ingenieurswetenschappen
Universiteit Gent
prof. Ingrid Verbauwhede
Departement Elektrotechniek (ESAT)
Faculteit Ingenieurswetenschappen
Katholieke Universiteit Leuven
iv
Samenvatting
FPGA’s De FPGA (Field Programmable Gate Array, in het veld pro-
grammeerbare poortenmatrix) is de laatste twintig jaar uitgegroeid tot
een standaardcomponent in het ontwerp van digitale schakelingen.
Doordat FPGA’s herconfigureerbaar zijn is het mogelijk een ontwerp
nog na de productie te wijzigen en kan de hoge kost voor de fabricage
van maskers vermeden worden. Dit verkort ook de ontwerp- en pro-
ductiecyclus. De vaste prijs per chip maakt een implementatie op een
FPGA goedkoper dan in een ASIC (Application Specific Integrated Cir-
cuit, toepassingsspecifieke geı¨ntegreerde schakeling) voor lage en mid-
delgrote productievolumes. Voor de herconfigureerbaarheid moet ech-
ter een prijs betaald worden: de grote hoeveelheid programmeerbare
interconnecties en logica brengt een lagere kloksnelheid en een minder
efficie¨nt gebruik van de beschikbare chipoppervlakte met zich mee.
Het geheugenknelpunt De hoeveelheid geheugen op een FPGA vol-
staat in vele gevallen niet om te voldoen aan de behoeften van mod-
erne applicaties. De meeste multimediatoepassingen zijn voorbeelden
van dergelijke data-intensieve toepassingen. Een extern geheugen is
nodig om ze te kunnen implementeren. De toegangen naar een extern
geheugen verlopen echter trager dan de transacties binnen de FPGA,
zowel op het gebied van latentie als van bandbreedte. Hierdoor dreigt
dit externe geheugen een knelpunt te worden van het ontwerp: als
tijd verloren gaat door het wachten op datatransfers van en naar dit
geheugen kan de grote rekenkracht van een FPGA niet benut worden.
Een oplossing om de datatrafiek tussen de FPGA en het externe
geheugen te verminderen is het gebruik van buffers op de FPGA. Ge-
gevens die meermaals gebruikt worden hoeven dan slechts eenmaal
gekopieerd te worden naar de buffer en kunnen daarna uit dit buffer-
geheugen gelezen worden. Op elk moment kan slechts een beperkte
hoeveelheid data in de buffer aanwezig zijn. Daarom moet getracht
vi SAMENVATTING
worden alle bewerkingen op een bepaald data-element te groeperen in
de tijd, het zogenaamde verbeteren van de temporele lokaliteit. Hier-
voor kunnen lustransformaties gebruikt worden die de volgorde van
de berekeningen veranderen. Lustransformaties worden veelvuldig
gebruikt voor de optimalisatie van software, meer bepaald het cache-
gedrag. In hardware-ontwerpomgevingen, bijvoorbeeld hoog-niveau-
synthesetools, beperkt het toepassen van lustransformaties zich voor-
namelijk tot het verhogen van de hoeveelheid parallellisme. Voor loka-
liteit is er tot nog toe (te) weinig aandacht.
Dit proefschrift Dit proefschrift behandelt lustransformaties voor het
verbeteren van de datalokaliteit en het integreren ervan in het hard-
ware-ontwerpproces. Drie aspecten worden hierbij nader onderzocht:
(1) het samenstellen van lange reeksen lustransformatiestappen, (2) het
zoeken naar bovengrenzen op (quasi-)veeltermfuncties over discrete
domeinen voor het statisch evalueren van implementatie-eigenschap-
pen, en ten slotte (3) het genereren van hardware-beschrijvingen uit de
polyedrische voorstelling die in dit werk gebruikt wordt om lustrans-
formaties te beschrijven en toe te passen.
Het polyedrisch model Het iteratiedomein van een programmastate-
ment is de verzameling van alle waarden die de iteratievector, de vec-
tor samengesteld uit de iteratoren van de lussen omheen een state-
ment, aanneemt tijdens de uitvoering. Wanneer we ons beperken tot
lusnesten waarin elke lusgrens een lineaire combinatie is van de itera-
toren van de omhullende lussen en van gehele parameters dan komt
elk iteratiedomein overeen met een Z-polyeder. Dit is een verzameling
bestaande uit de geheeltallige punten in een polyeder of meer alge-
meen de doorsnede van een geheeltallig rooster met een polyeder. De
programmavoorstelling die hierop gebaseerd is heet het polyedrisch
(of geometrisch) model.
Sequenties van lustransformaties In een polyedrische voorstelling
worden lustransformaties beschreven als bewerkingen op matrices en
vectoren. Dit maakt het samenstellen van transformaties eenvoudiger
dan wanneer een tekstuele of abstracte-syntax-boomvoorstelling ge-
bruikt wordt. In dit proefschrift wordt onderzocht wat de invloed van
de volgorde van transformatiestappen is en hoe korte aaneenschake-
lingen van transformatiestappen samengesteld kunnen worden tot lan-
gere reeksen.
SAMENVATTING vii
Afbakenen van het bereik van quasi-veeltermen Om de keuze van
lustransformaties te sturen is het nodig om eigenschappen te kunnen
evalueren die het resultaat zijn van deze transformaties. Het schat-
ten of berekenen van dergelijke eigenschappen zonder het uitvoeren
van een implementatie is mogelijk door statische programma-analyse.
Voor de programma’s die kunnen voorgesteld worden in het polye-
drisch model, kunnen veel analyseproblemen herleid worden tot het
tellen van het aantal geheeltallige punten in een polyeder of het tellen
van het aantal geheeltallige punten in verzamelingen beschreven door
zogenaamde Presburger-formules. Dit zijn stelsels opgebouwd uit li-
neaire ongelijkheden samengesteld met logische operatoren en kwan-
toren. De oplossing van een dergelijk telprobleem kan beschreven wor-
den als een (stuksgewijze) Ehrhart quasi-veelterm in functie van de pa-
rameters van het telprobleem. In sommige analyseproblemen is het
nodig een bovengrens te vinden voor de oplossing van een telprobleem
over een bepaald discreet domein van parameterwaarden (de variabe-
len in de quasi-veelterm). Als bijvoorbeeld de hoeveelheid levende
data in een programma uitgedrukt wordt in functie van het tijdstip in
de uitvoering (de iteratorwaarden), dan is de minimale vereiste hoe-
veelheid geheugen om een uitvoering mogelijk te maken het maximum
hiervan over alle mogelijke uitvoeringstijdstippen.
In dit werk worden verschillende manieren voorgesteld om gren-
zen op het bereik van quasi-veeltermfuncties te bepalen. Eerst wordt
aangetoond dat de extrema van veeltermfuncties over een continu do-
mein een goede benadering zijn voor de extrema over de geheeltallige
punten binnen dit domein, op voorwaarde dat het domein voldoende
groot is en de graad van de veelterm voldoende laag. Partie¨le evaluatie
voor een deel van de veranderlijken kan gebruikt worden om de nauw-
keurigheid van deze benadering te verhogen. Door het toevoegen van
extra veranderlijken kunnen quasi-veeltermen omgezet worden naar
veeltermen zodat deze benaderingsmethode ook hier bruikbaar wordt.
Voor kleine domeinen is de exacte methode, namelijk het evalueren
van de quasi-veelterm in alle discrete punten van het domein, sneller
dan de voorgesteldemethoden die een continue benadering gebruiken.
Daarom werden hybride methoden onderzocht die voor kleine domei-
nen de exacte methode gebruiken en voor grotere domeinen een van
de continue benaderingen hierboven beschreven. Dit blijkt een goede
combinatie van de verschillende methoden te zijn (op het gebied van
nauwkeurigheid en rekentijd) met slechts een kleine kost voor de se-
lectie van de methode voor elk domein.
viii SAMENVATTING
Hardware-generatie vanuit een polyedrische voorstelling Lustrans-
formaties beı¨nvloeden niet alleen de datatransfers maar hebben ook
een invloed op de controlecomplexiteit van een implementatie. Dit
manifesteert zich meestal slechts na het verfijnen van een ontwerp tot
op een synthetiseerbaar niveau, hetgeen het verkennen van de ontwer-
pruimte van mogelijke lustransformaties bemoeilijkt. Daarom is het
nuttig lustransformaties te integreren in hoog-niveau-synthesetools. In
dit proefschrift wordt een hardware-architectuur voorgesteld die toe-
laat om een hardwarebeschrijving te genereren vanuit een polyedrische
voorstelling. Verschillende afwegingen tussen oppervlakte en klok-
snelheid zijn mogelijk. De code generator, CLooGVHDL, werd getest
aan de hand van het genereren van varianten van een inverse discrete
wavelet-transformatie. De syntheseresultaten zijn beter dan die van de
commercie¨le hoog-niveau-synthesetool Impulse C en competitief met
die van de Celoxica Handel-C compiler.
Besluit In dit proefschrift worden verschillende technieken voorge-
steld die bruikbaar zijn voor het vermijden van geheugenknelpunten
in hardware-ontwerp. Het opstellen van lustransformatiesequenties
om de lokaliteit te verbeteren, het zoeken naar grenzen op het bereik
van quasi-veeltermen om het resultaat van dergelijke transformaties te
evalueren en ten slotte het genereren van hardware vanuit de polyedri-
sche voorstelling die voor deze transformaties gebruikt wordt.
Summary
FPGAs In the last two decades, the Field Programmable Gate Array
(FPGA) has become a standard component in digital design. Its re-
configurability allows to update a design after production and avoids
the cost of producing masks for wafer processing. As a result FPGAs
come at a fixed cost per chip, which makes FPGA implementations
cheaper than ASICs (Application Specific Integrated Circuits) for small
and moderate production volumes. The penalty paid for the reconfig-
urability is a lower clocking frequency and a lower chip density. Still,
FPGAs can outperform instruction set processors when the available
parallelism can be exploited.
The memory bottleneck Multimedia applications have emerged on
all kind of devices, from desktop computers up to PDAs and cellphones.
These applications are not only computation-intensive but also data-
intensive, which means a large amount of memory is needed. To imple-
ment data-intensive applications on FPGAs off-chip memory is needed.
The accesses to this memory are slower (bandwidth and latency) than
accesses to on-chip memory and are a potential bottleneck. The high
computational power of a FPGA can not be exploited if a design suf-
fers a memory bottleneck.
A memory hierarchy should be constructed to decrease the number
of off-chip transactions. If a data element is copied to an on-chip buffer
this data element can be used several times with only one access to the
external memory. The buffers have a limited size and cannot contain
a copy of all data at the same time. Therefore, the different accesses
to a data element should be close together in time, i.e. exhibit a good
temporal locality. Loop transformations are a means to improve the
data locality by changing the execution order of computations and data
accesses. This technique is commonly used for software optimizations,
in particular optimization of the cache behavior. Current high-level
x SUMMARY
synthesis environments for hardware design lack support to implement
data-intensive applications on heterogeneous memory systems. They
focus rather on parallelism than on locality.
This thesis This thesis addresses the memory hierarchy problem to
high-level transformations of loop structures and the integration of
these transformations in the hardware design flow. This work focuses
on three aspects: (1) the composition of long transformation sequences,
(2) the search for bounds on polynomials and quasi-polynomials over
discrete domains for static program analysis and finally (3) the gener-
ation of hardware descriptions from a polyhedral representation, used
to describe and apply loop transformations.
The polyhedral model We restrict ourselves to loop nests in which
all loop bounds are linear expressions in some integer parameters and
the iterators of surrounding loops. As a result, the iteration domain of
a program statement, i.e. the set of all possible values of the iteration
vector (the vector composed of the iterators of all surrounding loops),
can be described as a Z-polytope, i.e. the set of integer points, or more
generally the set of points in an integer lattice, that lie inside a rational
polytope. The program representation based on this property is called
the polyhedral model.
Sequences of loop transformations In a polyhedral representation
loop transformations are described as operations on matrices and vec-
tors. This eases the composition of transformation sequences in com-
parison with a textual or abstract syntax tree representation. In this
thesis the influence of the order of transformation steps is studied to-
gether with the possibility to build long transformation sequences by
combining short sequences.
Bounds on quasi-polynomials Static program analysis involves the
estimation or computation of program properties without execution.
This can be used to guide the choice of loop transformations to apply.
For programs that can be represented in the polyhedral model, many
analysis problems can be reduced to counting the number of integer
points in a polytope or the number of integer solutions to so-called
Presburger formulas, i.e. systems of inequalities connected with logical
operators and quantifiers. The solution of such a counting problem can
SUMMARY xi
be expressed as a (piecewise) Ehrhart quasi-polynomial that is a func-
tion of the parameters of the problem. In some cases an upper bound on
a quasi-polynomial over a discrete domain is needed. If, for example,
the number of live data elements is expressed as a function of the point
of execution of a program, then the minimal memory requirement of
that program is the maximum over all points of execution.
This thesis presents several methods to find bounds on the range
of quasi-polynomials. First, it is proved that the extrema of a polyno-
mial over a continuous domain are a good approximation of the ex-
trema over the integer points in that domain, provided that the degree
of the polynomial is sufficiently low and the size of the domain is suffi-
ciently large. Partial evaluation for a selection of the variables allows to
improve the accuracy of this approximation. By introducing new vari-
ables quasi-polynomials can be transformed into polynomials to make
these approximations useful to find bounds on quasi-polynomials. For
small domains the exact method that evaluates the quasi-polynomial in
all discrete points of the domain is faster than the presented methods.
Therefore, hybrid methods have been constructed that apply the exact
method on small domains and apply one of the presentedmethods that
use continuous domain approximations on larger domains. These hy-
brid methods appear to combine the strengths of the different methods
with only a small overhead introduced by the selectionmechanism that
selects the method to apply in a domain.
Hardware generation from the polyhedral model Loop transforma-
tions not only influence the data transfers but also the control complex-
ity of an implementation. The impact on the hardware performance can
typically only be quantified after refinement to a synthesizable level.
This hinders an exploration of the space of loop transformations. There-
fore, it would be beneficial to integrate loop transformations in high-
level synthesis tools. A hardware architecture is presented with close
correspondence to the polyhedral representation, which allows to gen-
erate hardware starting from a polyhedral representation. Parallelism
is supported by the architecture but not implemented yet in the current
code generator, called CLooGVHDL. Different trade-offs between area
and clock speed are investigated. CLooGVHDL has been tested by gen-
erating variants of an inverse discrete wavelet transform. The results
outperform those of the commercial high-level synthesis tool Impulse
C and are competitive to those of the Celoxica Handel-C compiler.
xii SUMMARY
Conclusion This thesis presents several techniques that are useful in
hardware design to avoid a memory bottleneck. The composition of
loop transformation sequences to improve the data locality, the search
for bounds on quasi-polynomials to evaluate the result of such trans-
formations and finally, the generation of hardware from a polyhedral
representation used to apply the loop transformations.
Contents
Dankwoord, Aknowledgements i
Examencommissie, Examination commission iii
Nederlandstalige samenvatting v
English summary ix
Contents xiii
List of abbreviations xix
List of notations xxiii
1 Introduction 1
1.1 Reconfigurable hardware . . . . . . . . . . . . . . . . . . 1
1.1.1 The benefit of reconfigurable hardware . . . . . . 1
1.1.2 Field programmable gate arrays . . . . . . . . . . 3
1.1.3 Design flow . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The need for loop transformations . . . . . . . . . . . . . 7
1.2.1 The memory bottleneck . . . . . . . . . . . . . . . 7
1.2.2 Loop transformations . . . . . . . . . . . . . . . . 9
1.3 Contributions of this work . . . . . . . . . . . . . . . . . . 12
1.3.1 Loop transformations . . . . . . . . . . . . . . . . 12
1.3.2 Bounds on quasi-polynomials over discrete do-
mains . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Hardware generation . . . . . . . . . . . . . . . . . 13
1.4 Structure of this thesis . . . . . . . . . . . . . . . . . . . . 13
2 High-level synthesis and C-based design 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 15
xiv CONTENTS
2.2 Overview of high-level synthesis environments . . . . . . 16
2.2.1 State-based description . . . . . . . . . . . . . . . 16
2.2.2 Process networks . . . . . . . . . . . . . . . . . . . 17
2.2.3 Arrays of processing elements . . . . . . . . . . . 18
2.2.4 C-based high-level synthesis . . . . . . . . . . . . 19
2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Loop transformations in the polyhedral model 23
3.1 The need for loop transformations . . . . . . . . . . . . . 23
3.1.1 Data locality . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Reuse distance . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Reuse distance analysis to guide loop transforma-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Vectors, polyhedra and polytopes . . . . . . . . . . . . . . 31
3.3.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Non-parameterized polyhedra and polytopes . . 32
3.3.3 Parameterized polyhedra and polytopes . . . . . 36
3.4 Program representation in the polyhedral model . . . . . 38
3.4.1 Statement and iteration domain . . . . . . . . . . 38
3.4.2 Schedule vector . . . . . . . . . . . . . . . . . . . . 39
3.4.3 Dependences . . . . . . . . . . . . . . . . . . . . . 41
3.5 Loop transformations . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Principle and example . . . . . . . . . . . . . . . . 45
3.5.2 Available infrastructure . . . . . . . . . . . . . . . 49
3.6 Composing sequences of loop transformations . . . . . . 54
3.6.1 Introductory example . . . . . . . . . . . . . . . . 54
3.6.2 Duplication of statement invocations . . . . . . . 57
3.6.3 Commutativity of elementary loop transforma-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6.4 A canonical order . . . . . . . . . . . . . . . . . . . 61
3.7 Useful transformation sequences for the 2-D IDWT . . . 61
3.7.1 From row-column-based to line-based (RC2LB) . 63
3.7.2 Vertical tiling (Tile V) . . . . . . . . . . . . . . . . . 64
3.7.3 Horizontal tiling (Tile H) . . . . . . . . . . . . . . 65
3.7.4 Overlap . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Building sequences from subsequences . . . . . . . . . . 66
3.9 Comparison with alternative representations . . . . . . . 70
3.9.1 Alternative polyhedral models . . . . . . . . . . . 70
CONTENTS xv
3.9.2 Transformations on ASTs . . . . . . . . . . . . . . 70
3.9.3 Advantages of the polyhedral model . . . . . . . . 71
3.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Bounds on quasi-polynomials for static program analysis 73
4.1 An example of static program analysis . . . . . . . . . . . 73
4.2 Quasi-polynomials and polyhedral counting problems . 76
4.3 Overview and methodology . . . . . . . . . . . . . . . . . 78
4.4 Bounds on polynomials in continuous domains . . . . . . 79
4.5 Cont.- vs. discr.-domain extrema of polynomials . . . . . 81
4.5.1 Univariate case . . . . . . . . . . . . . . . . . . . . 81
4.5.2 Multivariate case . . . . . . . . . . . . . . . . . . . 90
4.5.3 Optimization strategy . . . . . . . . . . . . . . . . 91
4.5.4 Comparison with Rivlin’s method . . . . . . . . . 93
4.6 Converting quasi-polynomials into polynomials . . . . . 94
4.6.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 Overview of methods with examples . . . . . . . 96
4.6.3 Other methods . . . . . . . . . . . . . . . . . . . . 101
4.6.4 Implementation . . . . . . . . . . . . . . . . . . . . 102
4.7 Memory size estimation experiments . . . . . . . . . . . . 103
4.7.1 The 1-D FIR-filter test case . . . . . . . . . . . . . . 103
4.7.2 The matrix multiplication test case . . . . . . . . . 109
4.8 Hybrid methods . . . . . . . . . . . . . . . . . . . . . . . . 115
4.9 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.9.1 Memory size estimation . . . . . . . . . . . . . . . 116
4.9.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . 118
4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5 Hardware generation from the polyhedral model 123
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Hardware architecture . . . . . . . . . . . . . . . . . . . . 125
5.2.1 Loop control architecture . . . . . . . . . . . . . . 126
5.2.2 Statement control and data path . . . . . . . . . . 129
5.3 Implementation details . . . . . . . . . . . . . . . . . . . . 129
5.3.1 Loop controller generation . . . . . . . . . . . . . 131
5.3.2 Statement control and data path generation . . . . 136
5.4 Simplification of loop bound expressions . . . . . . . . . 141
5.5 Optimizing the clock frequency . . . . . . . . . . . . . . . 144
5.5.1 Loop control . . . . . . . . . . . . . . . . . . . . . . 144
5.5.2 Data path . . . . . . . . . . . . . . . . . . . . . . . 148
xvi CONTENTS
5.6 Case study: exploration of IDWT variants . . . . . . . . . 148
5.6.1 Generating hardware variants . . . . . . . . . . . 150
5.6.2 Comparison of designs generated with CLooG-
VHDL, Impulse C and Handel-C . . . . . . . . . . 155
5.6.3 Building a memory hierarchy . . . . . . . . . . . . 156
5.6.4 Performance, power and energy analysis . . . . . 158
5.7 Discussion and conclusions . . . . . . . . . . . . . . . . . 159
6 Conclusions and future work 161
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2 Directions for future research . . . . . . . . . . . . . . . . 163
A The discrete wavelet transform and its inverse 165
A.1 The 1-D discrete wavelet transform . . . . . . . . . . . . . 165
A.1.1 Filters . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.1.2 The convolution-based 1-D DWT . . . . . . . . . . 166
A.1.3 The lifting scheme . . . . . . . . . . . . . . . . . . 169
A.1.4 Computational complexity . . . . . . . . . . . . . 170
A.2 The 2-D discrete wavelet transform . . . . . . . . . . . . . 170
A.2.1 Computational complexity . . . . . . . . . . . . . 173
A.3 Manual design of a RC-wise IDWT . . . . . . . . . . . . . 174
A.3.1 Specifications . . . . . . . . . . . . . . . . . . . . . 174
A.3.2 Modular design concept . . . . . . . . . . . . . . . 174
A.3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . 175
A.4 Image quality . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.4.1 PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.4.2 From floating-point to fixed-point . . . . . . . . . 178
A.4.3 Quality scalability . . . . . . . . . . . . . . . . . . 178
A.5 Related work on DWT variants . . . . . . . . . . . . . . . 178
B Reuse distances of IDWT variants 181
C Bernstein expansion 183
C.1 Parametric Bernstein expansion . . . . . . . . . . . . . . . 183
C.2 Incremental Bernstein expansion . . . . . . . . . . . . . . 186
D MRE of 3rd degree polynomials 191
D.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 191
D.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
E Bounds on quasi-polynomials with the barvinok library 197
CONTENTS xvii
E.1 Command line interface . . . . . . . . . . . . . . . . . . . 197
E.2 Library call . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
F Example of code generated by CLooGVHDL 199
Index 211
List of publications 215
Bibliography 221
xviii CONTENTS
List of abbreviations
1-D 1-Dimensional
2-D 2-Dimensional
AD Arithmetic Decoder
ASAP As Soon As Possible
ASIC Application Specific Integrated Circuit
ASIP Application Specific Instruction set Processor
AST Abstract Syntax Tree
BRD Backward Reuse Distance
CIF Common Intermediate Format (288 × 352 pixels,
30000/1001 (≈ 29.97) frames/s, YCbCr 4:2:0)
CLooG Chunky Loop Generator
CPU Central Processing Unit
DDR Dual/Double Data Rate
DMA Direct Memory Access
DRAM Dynamic Random Access Memory
DSP Digital Signal Processing
DTSE Data Transfer and Storage Exploration
DWT Discrete Wavelet Transform
EDIF Electronic Design Interchange Format
FIFO First-In-First-Out
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
FSM Finite State Machine
gcd Greatest Common Divider
GOP Group Of Pictures
GPL GNU General Public License
GPP General Purpose Processor
GPU Graphics Processing Unit
xx LIST OF ABBREVIATIONS
HDL Hardware Description Language
HW HardWare
IC Integrated Circuit
IDWT Inverse Discrete Wavelet Transform
IF InterFace
iff If and only if
ILP Instruction Level Parallelism
ILP Integer Linear Programming
IMEC Interuniversitair Micro-Elektronica Centrum/ In-
teruniversity Microelectronics Center
I/O, IO Input Output
IOE Input Output Element
IP Intellectual Property
KPN Kahn Process Network
LAB Logic Array Block
LB Line-Based
LBT Line-Based and Tiled
LE Logic Element
LE Set of Living Elements
LHS Left-Hand Side
LRU Least Recently Used
LUT Look-Up Table
MAC Multiply ACcumulate
MRE Maximal Relative Error
NPA Non-Programmable Accelerator
ORC Open Research Compiler
PSNR Peak Signal to Noise Ratio
QCIF Quarter CIF
QoS Quality of Service
RAM Random Access Memory
RaR Read after Read
RaW Read after Write
RC Row-Column(-wise)
RE Relative Error
RHS Right-Hand Side
RTL Register Transfer Level
RMSE Root Mean Square Error
LIST OF ABBREVIATIONS xxi
SCoP Static Control Part
SDRAM Synchronous Dynamic Random Access Memory
SLO Suggestions for Locality Optimizations
SNR Signal to Noise Ratio
SoC System on Chip
SOPC System On Programmable Chip
SUIF Stanford University Intermediate Format
SW SoftWare
Th Threshold
TLM Transaction-Level Modeling
URUK Unified Representation Universal Kernel
VGA Video Graphics Array (computer display stan-
dard)
VHDL Very high speed integrated circuits Hardware De-
scription Language
VLIW Very Long Instruction Word (processor)
WaR Write after Read
WaW Write after Write
WED Wavelet Entropy Decoder
WHIRL Winning Hierarchical Intermediate Representa-
tion Language
WRaP-IT WHIRL Represented as Polyhedra - Interface Tool
xxii LIST OF ABBREVIATIONS
List of notations
N0, Z0 N \ {0}, Z \ {0}
aZ The set of multiples of a:
aZ = {m ∈ Z | m = aq for some q ∈ Z}
⌊·⌋ Integer part: ⌊x⌋ = y ⇔ y ∈ Z ∧ x− 1 < y ≤ x
{·} Fractional part: {x} = x− ⌊x⌋
| · | Absolute value
|S| Number of elements of the set S (cardinality)
(cond ? a : b) (cond ? a : b) =
{
a if cond,
b otherwise.
% mod, modulo operator
a ≡ b (mod d) a mod d = b mod d
δij Kronecker delta function: δij =
{
1 if i = j ,
0 if i 6= j .
S1 Statement S1
DS1 Depth of statement S1
DS1 Iteration domain of S1
IS1 Iteration vector of S1
DS1δS2 Dependence Domain of dependences from state-
ment S1 to S2
LEA(t) Set of Live Elements of array A at time t
θS1 Schedule function of statement S1
F b, Gb, . . . Upper bound on f, g, . . . using Bernstein expan-
sion
F b, Gb, . . . Lower bound on f, g, . . . using Bernstein expan-
sion
F c, Gc, . . . Continuous-domain maximum of f, g, . . .
F c, Gc, . . . Continuous-domain minimum of f, g, . . .
F d, Gd, . . . Discrete-domain maximum of f, g, . . .
F d, Gd, . . . Discrete-domain minimum of f, g, . . .
xxiv LIST OF NOTATIONS
RE Relative error of continuous- versus discrete-
domain range of a polynomial
REx RE after partial evaluation with x kept as free
variable
RE rivl Maximal RE of a polynomial computed with
Rivlin’s method
MRE Maximal possible value ofRE for any polynomial
of a given degree for a given domain size
SN SN = {S ⊂ {0, 1, . . . ,N}, |S| = g + 1}
LS,i(x) Lagrange interpolant, defined in Equation (4.11) :
LS,i(x) =
∏
0≤j≤g,j 6=i
x−xj
xi−xj
,
with i ∈ {0, 1, . . . , g}, S = {x0, . . . , xg} ∈ SN
MN (x) MN (x) = minS∈SN
∑g
i=0 |LS,i(x)|
K,KN , K(g,N) K(g,N) = maxx∈[0,N ] MN (x)
Chapter 1
Introduction
The Field Programmable Gate Array (FPGA) has become a standard com-
ponent in digital design. Its reconfigurability allows to update a design after
production and avoids the cost of producing masks for wafer processing. This
makes FPGA implementations cheaper than ASICs (Application Specific In-
tegrated Circuits) for small and moderate production volumes. The penalty
paid for the reconfigurability is a lower clocking frequency and a lower chip
density. Still, FPGAs can outperform processors when the massive available
parallelism can be exploited.
Multimedia applications have emerged on all kind of devices, from desk-
top computers up to PDAs and cellphones. These applications are not only
computation-intensive but also data-intensive, which means a large amount
of memory is needed. To implement data-intensive applications on FPGAs off-
chip memory is needed. The accesses to this memory are a potential bottleneck.
To avoid a memory bottleneck, a memory hierarchy should be constructed
and loop transformations are needed to improve the data locality. Current
high-level synthesis environments lack support to implement data-intensive
applications on heterogeneous-memory systems. They rather focus on paral-
lelism.
1.1 Reconfigurable hardware
1.1.1 The benefit of reconfigurable hardware
General Purpose Processors (GPP) and their low-end variants, micro-
controllers, offer a standard platform for implementing an algorithm.
Thanks to mass production they are relatively cheap and they can eas-
2 Introduction
ASIC FPGA GPPVLIWASIP
performance flexibility
Figure 1.1: Trade-off between flexibility and performance.
ily be (re-)programmed by compiling (high-level) software using an
appropriate (standard) compiler. Only when the desired performance
cannot be reached (speed, power, . . . ) another platform will be chosen.
Very Long Instruction Word processors (VLIW), Digital Signal Proces-
sors (DSP) and Application Specific Instruction set Processors (ASIP),
e.g., Graphics Processing Units (GPU), are more specialized towards a
specific application (domain), and can thus offer a higher performance
(within the application domain) at the cost of a lower performance for
code that does not match the specialized architecture. Also the writing
of software becomes more complex.
Application Specific Integrated Circuits (ASICs) give the designer
total freedom on how to use the available silicon real estate. The hard-
ware architecture can now be built targeted to the application, which
may lead to the highest possible performance. However, the cost of
ASIC design and manufacturing drastically increases as the technology
evolves to lower feature sizes. Only for very large production volumes
the cost of producing masks is affordable. The long production time
between the tape-out of a design and the delivery of test chips makes
iterating over a design slow and expensive. Next to the large produc-
tion cost and time, there is a high non-recurring engineering cost. The
low-level design is very labor intensive and thus also increases the cost
of iterating over a design.
Reconfigurable hardware offers a compromise between the flexibil-
ity of processors and the performance of ASICs (Figure 1.1). Reconfig-
urable devices are (mass produced) standard components, which have
a fixed price per chip. They consist of an array of elementary hardware
blocks connected by a network of programmable interconnections. All
hardware blocks can work in parallel and by this offer a huge potential
performance. Both the hardware blocks and the interconnections can
be reconfigured. When done after power-up this is called static recon-
figuration. Many systems use static reconfiguration for upgrading or
debugging a system. This possibility is a large advantage compared to
ASICs, which cannot be changed after production. Dynamic reconfigu-
ration means switching between configurations at run-time depending
1.1 Reconfigurable hardware 3
IOEs LABs LABs LABs LABs
LABs
LABs
IOEs IOEs IOEsIOEs IOEs
LABs
LABs
IOEs
RAM
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABs
IOEs LABs LABs LABs LABsRAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
DSP Block
DSP Block
M−RAM
Figure 1.2: Example FPGA structure, inspired by the Altera Stratix device [13].
IOE: Input Output Element, LAB: Logic Array Block, DSP Block: Digital Sig-
nal Processing Block.
on the tasks that have to be executed at a certain point in time, or to
time-multiplex the available hardware if the available size is not large
enough to contain the entire design. When talking about dynamic re-
configuration one can distinguish between full and partial reconfigura-
tion. In the latter case only part of the device is reconfigured .
1.1.2 Field programmable gate arrays
Field Programmable Gate Arrays (FPGAs) are the most popular exam-
ple of reconfigurable devices. Figure 1.2 shows the architecture of a
typical FPGA. The hardware blocks can be (re)configured to specify
their function, e.g., the logic function of a Look-Up Table (LUT), the
word and address size of memory blocks, the voltage levels of I/O
pins. Together with the configuration of the interconnections this al-
lows to implement any digital design (within size constraints) on a
FPGA. The device can be reconfigured without limit. This shortens
the production-testing-cycle, compared to an ASIC and allows to up-
grade a system after production. However, a cost has to be paid for
this flexibility. The reconfigurable interconnection network occupies a
large part of the chip area and introduces long delays. As a result, the
achievable clock frequency will be relatively low. The computational
density [64] and performance are lower than on an ASIC but can still
outperform those of processors [97, 36]. FPGAs can be more efficient
4 Introduction
than instruction set architectures when the algorithm has a high degree
of regularity and exploits high levels of parallelism.
FPGAs have evolved from very regular, fine-grain reconfigurable
devices [58], which were mainly used to contain the glue logic or parts
of a system, towards large heterogeneous devices with coarser blocks,
such as block RAMs, embedded multipliers and even processors. They
can contain an entire system on a chip (SoC) [1, 8]. Next to FPGAs,
other reconfigurable architectures have been developed [145, 132, 103].
Many of them are composed of an instruction set processor and a re-
configurable array, which serves as an accelerator for custom instruc-
tions [125, 50].
A disadvantage of the fine-grain reconfigurability of FPGAs is the
large area and delay cost of the interconnections, and the large reconfig-
uration time, typically some tens of milliseconds. Coarse-grain archi-
tectures restrict the reconfigurability, e.g., by fixing the word size of op-
erations, and can therefore run at higher frequencies and have a lower
reconfiguration time. This was also the reason for the introduction of
coarser blocks inside FPGAs. Kuon and Rose [114] have measured the
difference between FPGAs and ASICs in terms of logic density, circuit
speed and power consumption for core logic, i.e. without consider-
ing I/O, on a set of benchmark designs. They found that for a 90-nm
CMOS technology the ratio of silicon area required in FPGAs vs. ASICs
is on average 35 when only using LUTs and flip-flops. When using hard
blocks (multipliers and block memories) this average area gap drops to
18.
1.1.3 Design flow
Static configuration
In most of the systems, FPGAs are only configured at power-up, fol-
lowed by a system reset (static reconfiguration). In this case the design
trajectory is very similar to the higher levels of a VLSI design trajectory
(Figure 1.3). FPGA vendors offer tools that can generate a configuration
bitstream starting from Register Transfer Level (RTL) VHDL or Verilog.
These tools take care of the technology mapping (synthesis) and plac-
ing and routing of the structural elements on the FPGA. After placing
and routing, the number of hardware blocks used, the achievable clock
frequencies, critical paths, . . . are reported. The path from an algorithm
description or system specification to synthesizable RTL code is less
1.1 Reconfigurable hardware 5
Place & Route
RTL Synthesis
FPGA Configuration
Algorithm
Synthesizable HW Description
Figure 1.3: Simplified FPGA design flow.
System level
Algorithmic level
RTL
Logic level
Circuit level
(d)
(a)
StructureBehavior
Physical
(c)
(b)
Figure 1.4: Gajski and Kuhn’s Y-chart [87] positioning tool support for FPGA
design. (a) RTL synthesis. (b) Place & Route. (c) System builder. (d)High-level
synthesis.
automated and involves a lot of manual design effort. Not only has
behavior to be specified and refined, but also structure has to be taken
in mind. This differs a lot from writing software code and requires to
make choices in a huge design space as presented by [63].
The design flow can be visualized on a Y-chart as introduced by
Gajski and Kuhn in [87] (Figure 1.4). The concentric rings represent
levels of abstraction, from system level at the outer to circuit level at the
6 Introduction
inner. The three axes represent three different representations of a de-
sign: the behavioral or functional representation, the structural represen-
tation and the physical (geometrical) representation. The exact number
of levels and their names may vary between variants of this chart, de-
pending on the context.
Multiple tools, such as Altera SOPC builder [15] or Matlab Simulink
allow to easily build a system by connecting modules from a library or
user-mademodules that meet a certain interface. The connections spec-
ified between modules at a high level are automatically refined to RTL
(Figure 1.4(c)1). This alleviates the task of gluing a system together but
leaves the task of designing the modules and thinking up the system
architecture.
Recently, several high-level synthesis tools have emerged that claim
to raise the synthesizable level from RTL to algorithmic level (Fig-
ure 1.4(d)) [153]. In reality, they offer a trade-off between abstraction
level, i.e. design effort, and quality of the resulting hardware. In most
cases, the coding style of an algorithm has a large influence on the syn-
thesis results and the code has to be written with the target architecture
in mind.
Dynamic reconfiguration
Dynamic (partial) reconfiguration asks for an alternative design flow
[126, 59, 118] and will not be considered in this work. In the Gajski
Y-chart, dynamic reconfiguration can be represented by adding a recon-
figuration level below the algorithmic level [74], or by interpreting struc-
ture not only as structure in space, but also as structure in the combined
space-time.
1In this figure the (non-standard) term System Builder is used. In [86] both (c) and
(d) would be called high-level synthesis, the first in a bottom-up design methodology,
the second in a top-down methodology. Nowadays, the widely used term high-level
synthesis denotes the transfer from a high-level behavioral description to a low-level
structural description. This convention is followed in this work.
1.2 The need for loop transformations 7
Cache
Memory
Main
CPU
A B
processor chip
Figure 1.5: Simplified representation of the memory hierarchy in a processor-
based system (Von Neumann architecture).
1.2 The need for loop transformations
1.2.1 The memory bottleneck
In the last decade multimedia applications have emerged on many
places. Image and video processing are an example of applications
that not only demand a high computational power, but also are data-
intensive. They work on huge sets of data. Not all this data can be
stored on a processor chip or FPGA. Therefore, an external memory
has to be added (Main Memory in Figure 1.5 and 1.6).
The processing power of processors and FPGAs has increased expo-
nentially during the last decades, not only by increasing the clock fre-
quency but also by increasing the parallelism in the execution. The per-
formance of memory has also increased exponentially, however with
a smaller exponent. As a result, there is a widening processor-memory
gap. The processing speed is often not limited by the processing of
the data but rather by the time to fetch and store the needed data (a
so-called memory bottleneck). The memory inside a processor core (reg-
ister banks) or FPGA (registers, memory blocks) of course works at the
same speed as the computational units, but the used technology, SRAM
(Static RAM), is too expensive (area and power) to build the large main
memory of a system.
To bridge this gap a memory hierarchy is built by inserting on-chip
buffer memories between the processor and themain memory, e.g., one
or multiple caches. Figure 1.5 shows the simple case of a single cache.
The cache memory size is much larger than the register files in the pro-
cessor but much smaller than the main memory. The data transfers on
link B are slower (latency and bandwidth) than inside the CPU (Central
Processing Unit), but much faster than on link A to the main memory.
Thememory blocks inside a FPGA can also be used to construct amem-
ory hierarchy. In Figure 1.6 the general system architecture that will be
8 Introduction
Functional
BlockMem
Main
mem C
mem A
mem B
FPGA
Figure 1.6: General (FPGA) memory hierarchy used within this work. Several
of the memories might coincide or consist of multiple memories.
used in this work is defined. Each functional block has amemory to store
intermediate results (mem B), and communicates with the main mem-
ory and other blocks through buffers (mem A and mem C). Some of
these memories may consist of smaller memories that can be accessed
in parallel or might be omitted or combined with other memories. The
infrastructure that connects the blocks may be a bus, as suggested by
the figure, but can also consist of, e.g., a switch fabric [15] or point-to-
point connections.
To benefit from such a memory hierarchy the data in the buffers has
to be reused as much as possible to decrease the accesses to the exter-
nal memory. Therefore, the different accesses to a single data element
should be close together in time, i.e. exhibit a good temporal locality,
and consecutive accesses in time should access elements that are close
to each other in the memory, i.e. exhibit a good spatial locality. The lat-
ter allows to transfer data to and from the external memory in bursts,
which is more speed and power efficient.
A major point of difference between GPPs and FPGAs is that the
memory hierarchy on a FPGA is made application specific and the data
transfers are not controlled by a cache policy but are explicit, i.e. under
full control of the designer, similar to scratch pad memories [23, 111].
It is the task of the designer to build a memory hierarchy that is appro-
priate for the application and avoid a memory bottleneck by ensuring
that the data locality is sufficient.
Until now, high-level synthesis tools offer little support for the con-
struction of memory hierarchies.
1.2 The need for loop transformations 9
for (i=0;i<=N;i++){
for (j=0;j<=N-i;j++){
if (i==0)
B[j]=1; // S1(i,j)
B[i+j]=B[i+j]*A[i][j]; // S2(i,j)
}
}
for (k=0;k<=N-1;k++){
C[k]=B[k]+B[k+1]; // S3(k)
}
Figure 1.7: Small program example. The accesses to array B have a bad tem-
poral locality.
j
i
k
0
1
N
10 N
10 N
S1(i, j)
S2(i, j)
S3(k)
B[N − 1]
Figure 1.8: Polyhedral representation of the domains of the statements in Fig-
ure 1.7. The execution order is from left to right and from top to bottom. The
accesses to the element B[N − 1], indicated by dashed lines, illustrate the bad
temporal locality.
1.2.2 Loop transformations
Asmentioned in the previous section, the data locality has a large influ-
ence on the performance of a system. Loop transformations can serve
as a means to improve the data locality. They alter the execution order
of computation steps, which can, e.g., bring accesses to the same data
element closer together to improve the temporal locality. This will be il-
lustrated by the following example. The definition of some terms used
here is delayed until Chapter 3.
Consider the small loop nest containing three statements in Fig-
ure 1.7. A graphical (geometrical or polyhedral) representation of the it-
eration domains is shown in Figure 1.8. All the elements of arrayA and
C are only accessed once. The elements of array B are accessed mul-
10 Introduction
tiple times with a bad temporal locality, as illustrated for the element
B[N−1]. New iterators i′ and j′ are defined by the following mapping2
of statement invocations between the i, j, k space and the i′, j′ space
S1(0, j) 7→ S1(0, i′)
S2(i, j) 7→ S2(j′, i′ − j′)
S3(k) 7→ S3(i′ − 1) .
The mappings define the iteration domains in the i′, j′ space (by the
condition that the same statement invocations occur). The execution
order, now defined by (i′, j′), is altered as shown in Figure 1.9 and 1.10.
The temporal locality has improved. Within the innermost loop (itera-
tor j′) all statement invocations access the same element of array B. At
each moment at most two elements of B contain data that is needed
later on, i.e. are alive. As a result, the array can be reduced to two ele-
ments. This is done by the introduction of a modulo operation (mod 2)
in the index expressions.
This example shows how a loop transformation can improve the
temporal locality of a program. Also the spatial locality and exploitable
parallelism can be improved by loop transformations. Tools exist that
apply user specified loop transformations on a program [29]. Finding
loop transformations that optimize a certain property is much harder
to automate.
Often, naively applying transformations, such as loop fusion, is
prohibited since this would violate data dependences. In order to
make them valid, a sequence of enabling transformations is needed.
Typically, the data locality has to be made worse first, in order to
make subsequent optimizations possible. Because of this, optimizing
compilers often fail to automatically find the best sequence of trans-
formations [90]. Iterative compilation has been proposed to automate
the search for such sequences. The strategy is to iteratively compile
a program with different sequences of optimizations and to evalu-
ate the result by profiling the resulting binary. Based on the results,
new sequences of transformations are explored. Alternatively, Cohen
et al. [57] propose to use a processor simulator to analyze the main
performance bottlenecks in a manual iterative process.
The process of optimizing the locality of a program by loop trans-
formations can be split into three steps:
2How this mapping is found will be explained in Chapter 3.
1.2 The need for loop transformations 11
j′
i′
0
1
N
10 N
S1(0, i′)
S2(j′, i′ − j′)
S3(i′ − 1)
B[N − 1]
Figure 1.9: Domain after transformation. Execution order from left to right
and from top to bottom. The locality of arrayB has improved. Maximum two
elements are alive at the same time. The accesses to B[i′] and B[i′ − 1] can be
replaced with accesses to B[i′ mod 2] and B[(i′ − 1) mod 2].
for (i’=0;i’<=N;i’++){
B[i’]=1; // S1(0,i’)
for (j’=0;j’<=i’;j’++)
B[i’]=B[i’]*A[j’][i’-j’]; // S2(j’,i’-j’)
if (i’>=1)
C[i’-1]=B[i’-1]+B[i’]; // S3(i’-1)
}
Figure 1.10: Code after loop transformations, corresponding to Figure 1.9
• First, the parts of the code that cause bad locality have to be iden-
tified. This can be done by manual analysis (only feasible for
small programs) or assisted by tools. The techniques can be di-
vided into static code analysis and profiling of program runs.
• Once the bad code is identified, one has to seek loop transforma-
tions that improve the locality.
• The last step is the application of the loop transformations on the
code. For this stepmultiple tools and/or environments have been
developed, such as SUIF [6, 117, 100] and WRaP-IT/URUK [57,
90].
Iterating over these steps may be needed to get a satisfying result.
Heuristics are needed, as the search space of possible transformations
is too large to explore exhaustively. When designing hardware, some
of the impact of a transformation might only become visible after de-
scending to a synthesizable level. In this case iterating has a large cost,
which augments the need for design automation.
12 Introduction
To exploit the locality created by loop transformations further steps
are needed, such as data reuse exploration, memory data layout and
address optimizations [135, 131]. The DTSE methodology (Data Trans-
fer and Storage Exploration) [52], developed at IMEC, offers a frame-
work in which all these optimization steps have their place. Applying
loop transformations is one of the first steps in this methodology.
1.3 Contributions of this work
Current high-level design environments offer little support to imple-
ment data-intensive applications on heterogeneous-memory systems;
they rather focus on parallelism. This thesis addresses the memory hi-
erarchy problem to high-level transformations of loop structures.
This work presents contributions to three different aspects in the
flow from loop transformations to synthesizable hardware descrip-
tions.
1.3.1 Loop transformations
Girbal illustrates in [90, 57] that optimizations directed by the human
intelligence of a programmer can still outperform iterative compilation.
Therefore, tools have been developed that automate the application of
loop transformations but leave the decision of which transformations
to apply to the programmer. These are specified as a sequence of ele-
mentary transformation steps.
In Chapter 3, we propose to raise the abstraction level from combin-
ing elementary transformation steps to combining application-specific
primary (sub)sequences. By doing so, long transformation sequences
can be constructed in a shorter time and more promising transforma-
tion variants can be generated. In this context, also the influence of the
order in which transformation steps are applied is studied.
1.3.2 Bounds on quasi-polynomials over discrete domains
Bounds on (piecewise) quasi-polynomials over discrete domains are
needed to statically evaluate properties, such as memory usage, of a
program obtained after loop transformations. To this end, several novel
methods to find such bounds are presented in Chapter 4 and compared
1.4 Structure of this thesis 13
with existing methods. The presented methods take advantage of the
fact that for large domains or small degrees the continuous-domain ex-
trema of polynomials are a good approximation of the discrete domain
extrema. For small domains the straightforward method of evaluating
the (quasi-)polynomial in each point is still faster, but this solution does
not scale for larger domains. We have proved that with a simple selec-
tion mechanism, which introduces almost no overhead, hybrid meth-
ods can be constructed that combine the strengths of different methods.
With more complex selection mechanisms less additional benefit is ex-
pected.
1.3.3 Hardware generation
Since loop transformations not only influence the data access pattern
but also the control complexity, integration of loop transformations and
hardware generation is needed to speed up the design space explo-
ration. Therefore, an architecture is presented in Chapter 5 with close
correspondence to the polyhedral representation used to apply loop
transformations. This architecture offers several options for area-speed
trade-offs and supports parallelism and automation of VHDL code gen-
eration.
1.4 Structure of this thesis
The different chapters of this thesis are organized as follows. Chap-
ter 2 gives an overview of related work on high-level synthesis and
C-based hardware description languages, with special attention to the
way memory is dealt with. Chapter 3 introduces the concepts of poly-
hedra and polytopes and how a polyhedral model can be used to ap-
ply loop transformations. Thereafter, the construction of long transfor-
mation sequences is discussed. In Chapter 4, the differences between
the extrema of polynomials over discrete and continuous domains are
studied. The results are used to construct methods to find bounds on
polynomials and quasi-polynomials over discrete domains. The archi-
tecture that supports hardware generation starting from a polyhedral
representation is presented in Chapter 5. Finally, conclusions and di-
rections for future research are listed in Chapter 6.
Several appendices provide more information about implemen-
tation details, experimentation setup and results. For example, an
14 Introduction
overview of the (inverse) discrete wavelet transform and several im-
plementation aspects is found in Appendix A.
Chapter 2
High-level synthesis and
C-based design
In recent years a lot of effort has been spent on raising the level of abstraction
of hardware design. One approach is the development of tool suites that allow
to build a system by selecting and connecting hardware blocks, custom made or
drawn from a library of hardware cores. Another approach is the introduction
of new design languages, typically C-based, that allow high-level synthesis
or at least offer a smoother path from software to hardware or an appropriate
means for HW/SW-codesign.
This chapter gives an overview of some high-level synthesis environments and
investigates how they deal with memory related aspects.
2.1 Introduction
Everyonewho has ever gone through the design process of a digital de-
sign, from high-level specification until synthesizable RTL code, under-
stands the need for design automation. Manually refining high-level
code involves decisions about architecture and timing. The impact of
design choices is often only visible after low-level synthesis. This slows
down the design iteration process and increases the development cost.
In recent years many high-level synthesis tools have emerged, each
with their own application domain, input language, synthesis strategy,
target architecture, . . . . They attempt to raise the abstraction level of
hardware design, similar to the evolution in software from assembly to
high-level languages.
16 High-level synthesis and C-based design
Different approaches exist. Some tools offer little more than syntac-
tic sugar, which requires hardware design knowledge of the user but
gives much control over the implementation results. Other tools target
software engineers with little hardware design knowledge and try to
hide all implementation details from the developer. Some map appli-
cations on predefined architectures, while others need user directives
to make architectural choices. Logically, there is a trade-off between
the quality of synthesis results and the design effort and the class of
applications that can be described.
This chapter gives an overview of several methods, with special at-
tention to the way memory accesses are dealt with.
2.2 Overview of high-level synthesis environments
Several categories of high-level synthesis tools are distinguished de-
pending on the target architecture and shape of the input description.
2.2.1 State-based description
Bluespec [2, 45] uses an operation-centric hardware abstraction [105],
useful for describing systems with a high degree of concurrent (and
thus complex) control. The behavior of a system is described as a col-
lection of atomic operations on a set of state elements (terms). Each
operation is specified as a set of simultaneous (hence atomic) state-
element updates, and a predicate on the state values that captures the
conditions under which it can occur. The effect of the state updates of
an operation is atomic, i.e. the legal behaviors of the system consti-
tute some sequential interleaving of the operations. One can describe
each operation as if the rest of the system is frozen. However, the re-
sulting implementation, generated by the Bluespec tools, carries out
multiple operations per clock cycle if this does not conflict with the se-
mantics of the atomic and sequential execution [104, 18]. This can, e.g.,
be used when a memory is shared by different computation blocks. Po-
tential conflicts caused by concurrent accesses are made impossible by
the synthesis methods.
The use of states and transitions to formally describe a system is
very similar to Petri Nets [163]. That also Petri Nets can be used to
describe a specification and refine, transform and translate it towards
an implementation in hardware is demonstrated in [148].
2.2 Overview of high-level synthesis environments 17
These methods offer the designer a means to deal with the large
complexity of concurrent control operations. However, they do not al-
leviate the task of developing an architectural structure and defining
the states of a design. As such, these methods are good for control dom-
inated systems but do not aid in dealing with data-intensive problems.
They deal with parallelism, but data locality is not an issue at this low
level of abstraction.
2.2.2 Process networks
Kahn Process Networks (KPN) [110] are a popular means to describe
parallel systems. A process network consists of processes that com-
municate with each other via First-In-First-Out (FIFO) queues, called
channels. The processes represent computation entities than can read
data from the incoming channels and write data to the outgoing chan-
nels. A process is stalled if it tries to read from an empty channel or
write to a full channel (blocking read and write). The size of the FIFO
buffers should be large enough to avoid dead-lock [11]. Several vari-
ants and subsets of process networks have been defined, such as Com-
municating Sequential Processes (CSP), Synchronous Data Flow (SDF)
and Cyclo-Static Data Flow [44]. An overview of suchModels of Compu-
tation is given in [108] and [44].
The advantage of using process networks is that no global control
exists. Control signals exist inside a node and FIFO channels are the
only means of communication between nodes. As a result, all nodes
can work in parallel and synthesis techniques can optimize each node
independently.
Streaming-based applications perfectly fit this kind of networks.
Such applications typically have a good spatial and temporal locality.
However, if a process node does not process data in the order it is pro-
duced by another node, the FIFO buffers between those nodes have to
be extended with reorder memories [157], which may result in a large
memory cost.
The Compaan/Laura tool suite translates polyhedral loop nests into
KPNs, by eliminating global memory and global control [156]. Reorder
memories are automatically insertedwhere data is not consumed in the
order it is produced [157]. Laura translates the KPNs into VHDL [171].
The User-Guided High-level synthesis tool (UGH) [21, 76], takes C
code and a Draft Data Path (DDP) of nodes in a KPN as input. After
18 High-level synthesis and C-based design
coarse and fine grain scheduling, a data path and Finite State Machine
(FSM) is generated. A mapping step is (optionally) used to compute
propagation delays, which are used to optimize the clock speed. No
loop transformations nor memory optimizations are performed.
Impulse C [4] uses a subset of C extendedwith IO-macros to build a
network of Communicating Sequential Processes (CSP). For each pro-
cess that is selected to run in hardware, a data path and one large finite
automaton is constructed. Communication with the other processes,
which stay in software, is done through FIFOs. The states of a FSM
relate directly to the control flow graph of the Impulse C code of the
corresponding process. This code may be annotated with pragmas for
pipelining or loop-unrolling. An experimental evaluation of Impulse C
is found in Section 5.6.2.
To have satisfying results with such environments the application
should have data access patterns that fit the model of producers and
consumers linked by FIFOs. If not, reorder buffers may result in a large
overhead or it may be possible that the final KPN consists of a single
node, which eliminates the advantages of using such a model of com-
putation.
Loop transformations may be needed to transform an application
into a form that fits the KPN model. However, support for such trans-
formations is not offered yet by the mentioned design environments.
2.2.3 Arrays of processing elements
Several projects aim at mapping loop nests on arrays of processing el-
ements, often called systolic arrays, though they form an extension to
the concept of systolic arrays in the strict sense. The hardware genera-
tion process consists of generating functional units, which can execute
the loop bodies (described in a high-level language), placed in an array
with controllers, memory and I/O. The iterations are mapped in both
space (processing elements) and time.
In MMAlpha [93], loop nests are represented in a functional, dy-
namic single assignment language, (a system of recurrence equations).
The code is entirely mapped onto a systolic array. To hide the bus or
memory latency, the input to the array should be put in order in ad-
vance [65]. The output is done in the order samples are produced.
In [93] Guillou et al. take care of the way on-chip memory blocks can
be used for intermediate results. However, the case in which external
2.2 Overview of high-level synthesis environments 19
memories are needed is left as future work. How scheduling should be
done to avoid large intermediate data sets is not addressed either.
The PICO project [140](Program In Chip Out) maps loop nests on
systolic arrays which serve as Non-Programmable Accelerators (NPA)
[147] for a specialized VLIW or EPIC processor (Explicitly Parallel In-
struction Computing). The user may annotate an array with a pragma
directing PICO to keep that array in a local memory. Register promo-
tion is used for arrays with uniform dependences to eliminate load-
/store operations. To limit the hardware cost for registers caused by
this register promotion, the iteration space is tiled. This tiling reduces
the iteration space that is implemented by the NPA (inner loops of the
nest). The outer loops run on the host processor. The input that can be
mapped on NPAs is limited to perfect loop nests.
PARO [101] constructs an array of identical processing elements and
explores the possible space-time mappings [102]. A mapping specific
interconnect topology [33] is built, and the control is not centralized
but distributed over the elements [32]. In order to match resource con-
straints partitioning techniques are applied [155] to either sequentially
execute tiles (local parallel, global sequential) or to sequentially exe-
cute operations within a tile (local sequential, global parallel) or to use
an intermediate scheme. This is similar to the tiling done by PICO.
All these projects target systolic arrays, and thus are less appropri-
ate for building non regular designs [94]. For this class of regular appli-
cations, it is typically not difficult to find schedules with a good locality.
2.2.4 C-based high-level synthesis
The most general input for high-level synthesis is a high-level software
language, e.g., C. The following environments all have an input for-
mat that is based on a subset of C. In many cases special constructs are
added to guide the synthesis process.
SPARK [98, 99] takes a behavioral description in ANSI-C as input
and generates synthesizable VHDL. It performs compiler transfor-
mations, using Hierarchical Task Graphs (HTG), that overcome the
effect of programming style on the quality of generated circuits. This
includes speculative code motion transformations, dynamic renam-
ing and heuristic scheduling. The methods work well to implement
control-intensive functions of scalars, but it assumes parallel access to
all its input data and is therefore not suitable to work with arrays or
20 High-level synthesis and C-based design
memory accesses [166].
The Cameron Project has created a high-level algorithmic language,
named Single Assignment C (SA-C) [46], for expressing image process-
ing applications. It has special constructs for vector and sliding win-
dow operations (processing an input stream as a sequence of smaller
overlapping blocks) and is functional in nature. Hierarchical Data De-
pendence and Control Flow (DDCF) graphs are used as an intermediate
representation. Optimizations are implemented in the SA-C compiler,
which may be controlled by user pragmas in the source code, e.g., to
evaluate space-time trade-offs. Compilation to FPGAs is done by gen-
erating a structural VHDL description, using pre-existing parameter-
ized library cores, e.g. corresponding to the vector and window oper-
ations. The problem of dealing with data locality is shifted to the pro-
grammer. He or she has to take care of it by using vector and window
operations whenever possible. The compiler itself is not able to extract
these constructs from a general program.
A similar concept, the use of hardware skeletons, is promoted by
Benkrid [37], inspired by software skeletons, proved successful in par-
allel programming. This framework is based on a library of parame-
terized descriptions of task-specific architectures to which the user can
supply parameters such as values, functions or other skeletons. Exam-
ples are again vector and window operations. Improving the memory
behavior, e.g., by introducing line-buffers is done in a predefined way.
Again, the user has to detect where input constructs with a pre-defined
solution can be used.
ROCCC (Riverside Optimizing Configurable Computing Com-
piler), a successor of SA-C, also targets window operations, but is able
to extract them from the input code, written in C with strong restric-
tions (no pointers, perfect loop nests with constant bounds, affine index
functions, . . . ). The SUIF (Stanford University Intermediate Format)
compiler [6] serves as front-end. Scalar replacement is used to exploit
the data reuse within the loop nests [95] and after user directed loop
unrolling [49] and other optimizations VHDL is generated [96]. The
controllers and address generators are implemented using pre-existing
parameterized FSMs from a VHDL library. For window operations
a so-called smart buffer is generated, which stores live input data and
exports data in parallel to the computation blocks. The instantiated
smart buffer stores each data element in a fixed register. It does not
shift data elements from one register to another. This leads to a com-
2.2 Overview of high-level synthesis environments 21
plex structure of multiplexers as criticized by Dong et al. in [75]. They
demonstrate that the clock speed is increased by shifting the data ele-
ments instead of using multiplexers at the output of the smart buffer.
Furthermore, it is not clear how ROCCC deals with code that does not
fit the sliding-window model.
A similar compilation and synthesis system is DEFACTO [73]. It
combines parallelizing compiler techniques with behavioral synthesis
technology. Its strategy is to distribute the processed data overmultiple
memory banks according to the access pattern. An automatic design
space exploration is performed to find the optimal unroll factor of the
loop nests [151], where a balanced design, i.e. on the transition from
compute bound to memory bound, is targeted. When the reuse distances
are too large, loop tiling is applied to reduce the number of on-chip
registers, similar to PICO and PARO. A large restriction is that only
constant loop bounds, not even parameterized, are considered.
Handel-C [3] extends the C language to express hardware function-
ality, such as word lengths, explicit timing parameters and parallelism
and limits the language to exclude C features, such as random point-
ers, that do not lend themselves to hardware translation. The Handel-C
compiler can directly generate EDIF-netlists (Electronic Design Inter-
change Format) for a target FPGA instead of VHDL, which results in a
higher performance (area and clock frequency). Due to the direct map-
ping fromHandel-C constructs to hardware, the coding style has a large
influence on the resulting hardware. Writing Handel-C code should be
donewith the resulting hardware in mind. An experimental evaluation
of Handel-C is found in Section 5.6.2.
The C2H compiler of Altera generates coprocessors for the Nios
II processor (soft core), that implement a C function [14]. Each ele-
ment of the C syntax, including pointers, is translated to an equivalent
hardware structure using straightforward mapping rules [10]. ASAP
scheduling (As Soon As Possible) is used. The developer has to know
the C-to-hardware mappings, to control the hardware structure of an
accelerator, based on the structure of the C code [12]. Else a large area
overhead may be introduced.
Since Handel-C and C2H have a straight-forward mapping from
data structures to memories it is the full responsibility of the program-
mer to build memory structures, take care of the data access locality
and have a good mapping of data to the memories.
22 High-level synthesis and C-based design
2.3 Conclusions
Next to tools that increase a designer’s productivity by offering meth-
ods to easily build a system by connecting IP-cores and other modules,
a lot of high-level synthesis tools with corresponding languages have
emerged. They offer a trade-off between manual design effort and the
efficiency of a hardware implementation.
The high-level synthesis methods described in this chapter all focus
more on exploiting parallelism than on improving data-access patterns.
This may lead to bandwidth limited designs. The only loop transfor-
mations mentioned above are loop unrolling to increase the parallelism
and loop tiling to reduce the number of on-chip memory used. It is
the responsibility of the user of a high-level synthesis tool to study
the memory access pattern and its locality and improve it if needed.
For this optimization process many other loop transformations can be
used. The next chapter deals with data locality and ways to improve it
by using loop transformations.
Chapter 3
Loop transformations in the
polyhedral model
The memory access pattern of a system typically has a large influence on
its execution speed and energy dissipation. Therefore, loop transformations
are needed to improve the spatial or temporal data locality. This chapter first
describes how programs can be represented and transformed using a polyhedral
representation. Then, the composition of sequences of transformations is lifted
to a higher abstraction level. Application specific primary sequences, composed
of elementary loop transformations, can be constructed and combined to more
easily explore long transformation sequences.
3.1 The need for loop transformations
3.1.1 Data locality
The processor-memory gap
As mentioned in Section 1.2.1, the processing power of processors and
the performance of memory chips both increase exponentially, but not
with the same exponent. This causes a widening processor-memory
gap. The large main memory is built with a slower technology than
used for the small memories inside the CPU or FPGA. As a result, for
many applications the performance is limited by the accesses to the
main memory and not by the computational power.
24 Loop transformations in the polyhedral model
Cache and scratch pad memory
To bridge this gap a memory hierarchy is built by inserting one or mul-
tiple buffer memories, e.g., caches and scratch pad memories, between
the processor and the main memory. The buffer memories are much
larger than the register files in the processor but much smaller than the
main memory. The data transfers between processor and buffers are
slower (latency and bandwidth) than inside the CPU (Central Process-
ing Unit), but much faster than on the link to the main memory.
We will now have a closer look at an architecture composed of a
CPU, a cache and a main memory (Figure 1.5). If the CPU needs data
that is not present in the cache (a cache miss), that data will be copied
from themain memory togetherwith neighboring data elements, filling
a cache line. A next access to an element of this cache line (a cache hit),
does not need a transfer from the main memory and will be faster and
more energy efficient.
When the cache is filled with data, fetching new data is only pos-
sible when other data is removed1 from the cache. Which data is
removed depends on the cache policy. This policy is a compromise
between implementation cost and cache performance. The Least-
Recently-Used policy (LRU), for example, removes the data elements
with the longest time since their last access. The optimal policy is re-
moving the elements that will not be used anymore or will be used
the furthest away in the future [34]. This, however, requires knowl-
edge about the future and is not possible when executing an arbitrary
program.
For simplicity the concept of associativity, which limits the locations
in the cache where data elements can be stored based on the address,
is not discussed here. For a more detailed study of cache behavior the
reader is referred to [40].
Data transfers that are grouped in bursts (transfers with consecu-
tive addresses) are faster and consume less energy. In a cache-based
system the burst length corresponds to a cache line. When the data ac-
cess pattern of a program is known at compile time it may be beneficial
to replace the cache with a scratch pad memory [23]. In this case, there
is no cache replacement policy and all data transfers between the main
memory and the scratch pad memory are controlled by the running
1Simply removed if the data has not been changed, written back to the main mem-
ory if it has.
3.1 The need for loop transformations 25
program [109, 111]. For many applications less transfers and longer
burst lengths can be obtained with application controlled scratch pad
memories.
In FPGA-based systems buffer memories should be constructed us-
ing the on-chip memory blocks. In contrast with general purpose pro-
cessors no fixed memory hierarchy exists. The memory hierarchy on
dedicated hardware (FPGAs and ASICs) is custom-built and thus ap-
plication specific. The designer has to take care of all memory transfers.
Spatial and temporal locality
To efficiently use the memory hierarchy, accesses to the main memory
should be minimized by (re)using data that is present in the cache as
much as possible. This can be expressed as having a good data locality.
A qualitative definition of the concept of data locality has been given
in Section 1.2.2. A high temporal locality means that different accesses
to a same data element are close to each other in time, while a high
spatial locality means that accesses that are close to each other in time
access elements that are close to each other in the memory. The former
depends on the execution order of memory accesses while the latter
also depends on the way data elements are arranged (mapped) in the
memory. A quantitative measure for temporal locality is given by the
reuse distance, which will be defined in Section 3.2.
The execution order of memory accesses has a high impact on the
temporal and spatial locality. Therefore, loop transformations, which
alter this order, can be used to improve the locality. This has already
been demonstrated with the small example in Section 1.2.2. The ma-
jor subject of this chapter is the description of a polyhedral model that
allows to automate the application of loop transformations and the pre-
sentation of techniques that guide the search for good loop transforma-
tions.
It should be noted that loop transformations are only one step in a
series of steps to optimize the memory hierarchy of a system and its
usage. A survey of data and storage optimization techniques is found
in [135]. Examples of other steps are memory mapping, register alloca-
tion and address generation.
26 Loop transformations in the polyhedral model
Example of a memory limited system
A manual design of an Inverse Discrete Wavelet Transform (IDWT) is
described in Section A.3. According to the synthesis and simulation re-
sults the execution speed reaches almost 79 frames/s (design 26 and 27
in Table 5.9). However, when put into a real system with several mod-
ules (all clocked at 50 MHz), which share an external DDR-SDRAM
memory, the frame rate drops down to 11.3 frames/s due to the time
wasted whenwaiting for data. When other blocks also initiate memory
transactions the frame rate is further reduced. With this IDWT imple-
mentation the RESUME video decoder (Section A.3.1) runs at only 5.14
frames/s, which is far from real-time. That loop transformations can be
used to improve the locality will be shown in Section 3.7 and 3.8. That
these transformations also increase the performance of a hardware im-
plementation will be demonstrated in Chapter 5.
3.1.2 Parallelism
FPGAs can only outperform processors, which have a much higher
clocking frequency, if enough operations can be done in parallel. Loop
transformations can be used to raise the exploitable parallelism in an
algorithm, typically in a trade-off with the locality. However, this is not
the focus of this work and will not be discussed further.
3.2 Reuse distance
3.2.1 Definitions
A quantitative measure for temporal locality is given by the reuse dis-
tance. The following (software) terminology is used:
Definition 1. A memory reference corresponds to a read or a write in the
source code of a program, while a particular execution of that read or write at
run-time is called amemory access. A reuse pair is a pair of accesses to the
same data location without intervening accesses to that location. The use of a
reuse pair is the first access, the reuse is the second access. The reuse distance
of a reuse pair is the number of unique data locations accessed between use and
reuse.
Definition 2. The backward reuse distance (BRD) of a memory access is
the reuse distance of the reuse pair in which this access is the second access.
3.2 Reuse distance 27
for (i=0;i<=N;i++){
for (j=0;j<=N-i;j++){
B[i+j]=B[i+j]*A[i][j]; // S2(i,j)
}
}
for (k=0;k<=N-1;k++){
C[k]=B[k]+B[k+1]; // S3(k)
}
Figure 3.1: Simplified version of the code in Figure 1.7.
If the element is accessed for the first time no such pair exists and the BRD is
defined as∞.
In a (fully associative2) cache with the Least Recently Used (LRU)
replacement policy, data is retained between use and reuse if and only
if the corresponding reuse distance is smaller than the cache or buffer
size [42].
The concepts introduced above are illustrated on the code in Fig-
ure 3.1. This corresponds to the code in Figure 1.7, but simplified by
removing the initialization statement S1. Table 3.1 lists the memory
accesses for a program run with N = 3. It is assumed that only the
array references generate memory accesses. A (fully associative) cache
with LRU replacement policy and size four is used. The elements in the
cache are sorted depending on the time since the last access. If the BRD
is smaller than four, i.e. the cache size, this number gives the position
in the cache before the access, counting from 0 onwards. For exam-
ple, the second access to B[0] in iteration (i, j) = (0, 0) has a BRD of
1, which corresponds to the position of B[0] after the access to A[0][0].
A histogram of the reuse distances is shown in Figure 3.3(a). A vertical
line indicates the cache size and separates the accesses which result in a
cache hit (left side) from those which result in a cache miss (right side).
After the transformations described in Section 1.2.2 (and removing S1)
2This means amainmemoryword can be stored in any location in the cache without
restrictions. In a cache with an associativity equal to k, each mainmemory word can be
stored in one of k places (defined by the address) in the cache. The special case when
k = 1 is called a direct mapped cache. Direct mapped and set-associative caches give
rise to conflict misses. This means data has been removed from the cache although other
data in the cache is unused for a longer time, caused by the restrictions on the location
in the cache where a word can be stored. In this work we do not consider the concept
of cache associativity.
28 Loop transformations in the polyhedral model
for (i’=0;i’<=N;i’++){
for (j’=0;j’<=i’;j’++)
B[i’]=B[i’]*A[j’][i’-j’]; // S2(j’,i’-j’)
if (i’>=1)
C[i’-1]=B[i’-1]+B[i’]; // S3(i’-1)
}
Figure 3.2: Code after loop transformations, corresponding to Figure 1.10
without S1.
 0
 2
 4
 6
 8
 10
 12
 14
 0  2  4  6  8  10  12  14
Reuse Distance
on-chip off-chip
 0
 2
 4
 6
 8
 10
 12
 14
 0  2  4  6  8  10  12  14
Reuse Distance
on-chip off-chip
(a) (b)
Figure 3.3: Histogram of the reuse distances of the example in Section 3.2.1
before (a) and after (b) loop transformations, corresponding to the code in
Figure 3.1 and 3.2, respectively. N = 3.
(Figure 3.2) a lot of accesses have a smaller reuse distance as illustrated
by the histogram in Figure 3.3(b).
As demonstrated above, the reuse distance histogram allows to cal-
culate the number of data transfers to/from the main memory in a
cache-based memory hierarchy. Also in other environments the reuse
distance histogram can be used as a measure for the temporal locality.
It then serves as an indication of the potential to avoid data transfers to
an external memory by using on-chip memory.
3.2.2 Reuse distance analysis to guide loop transformations
The reuse histogram shows whether a program has a good or a bad
locality. However, more information can be extracted from reuse pairs
by also looking at the location of use and reuse in the program. This
way the accesses that cause poor locality can be localized.
Three kinds of reuse pairs can occur in a programwith nested loops:
3.2 Reuse distance 29
Table 3.1: Memory accesses of the execution of the code in Figure 3.1 for
N = 3, with corresponding backward reuse distance (BRD) and cache con-
tent (cache with size 4 and LRU replacement policy). The cache content is
sorted according to the BRD.
Iteration Access BRD Cache Content (after access)
0 1 2 3
(i, j) = (0, 0) B[0] ∞ B[0]
A[0][0] ∞ A[0][0] B[0]
B[0] 1 B[0] A[0][0]
(0, 1) B[1] ∞ B[1] B[0] A[0][0]
A[0][1] ∞ A[0][1] B[1] B[0] A[0][0]
B[1] 1 B[1] A[0][1] B[0] A[0][0]
(0, 2) B[2] ∞ B[2] B[1] A[0][1] B[0]
A[0][2] ∞ A[0][2] B[2] B[1] A[0][1]
B[2] 1 B[2] A[0][2] B[1] A[0][1]
(0, 3) B[3] ∞ B[3] B[2] A[0][2] B[1]
A[0][3] ∞ A[0][3] B[3] B[2] A[0][2]
B[3] 1 B[3] A[0][3] B[2] A[0][2]
(1, 0) B[1] 4 B[1] B[3] A[0][3] B[2]
A[1][0] ∞ A[1][0] B[1] B[3] A[0][3]
B[1] 1 B[1] A[1][0] B[3] A[0][3]
(1, 1) B[2] 4 B[2] B[1] A[1][0] B[3]
A[1][1] ∞ A[1][1] B[2] B[1] A[1][0]
B[2] 1 B[2] A[1][1] B[1] A[1][0]
(1, 2) B[3] 4 B[3] B[2] A[1][1] B[1]
A[1][2] ∞ A[1][2] B[3] B[2] A[1][1]
B[3] 1 B[3] A[1][2] B[2] A[1][1]
(2, 0) B[2] 2 B[2] B[3] A[1][2] A[1][1]
A[2][0] ∞ A[2][0] B[2] B[3] A[1][2]
B[2] 1 B[2] A[2][0] B[3] A[1][2]
(2, 1) B[3] 2 B[3] B[2] A[2][0] A[1][2]
A[2][1] ∞ A[2][1] B[3] B[2] A[2][0]
B[3] 1 B[3] A[2][1] B[2] A[2][0]
(3, 0) B[3] 0 B[3] A[2][1] B[2] A[2][0]
A[3][0] ∞ A[3][0] B[3] A[2][1] B[2]
B[3] 1 B[3] A[3][0] A[2][1] B[2]
(k) = (0) B[0] 12 B[0] B[3] A[3][0] A[2][1]
B[1] 8 B[1] B[0] B[3] A[3][0]
C[0] ∞ C[0] B[1] B[0] B[3]
(1) B[1] 1 B[1] C[0] B[0] B[3]
B[2] 6 B[2] B[1] C[0] B[0]
C[1] ∞ C[1] B[2] B[1] C[0]
(2) B[2] 1 B[2] C[1] B[1] C[0]
B[3] 5 B[3] B[2] C[1] B[1]
C[2] ∞ C[2] B[3] B[2] C[1]
30 Loop transformations in the polyhedral model
 0
 2
 4
 6
 8
 10
 12
 14
 0  2  4  6  8  10  12
Reuse Distance
on-chip off-chip
fuse i/k
tile i
not loop-carried
tile k
Figure 3.4: Reuse distance histogram before transformations (Equivalent to
Figure 3.3(a)). The textures indicate loop transformations that may improve
the locality of the corresponding class of reuses.
• The use and reuse occur across different iterations of a single loop. This
pattern arises when the loop traverses a “data structure” (as small
as a single scalar variable or as large as all the data in the pro-
gram) in every iteration of the loop. The distance of reuses across
iterations can be reduced by ensuring that only a small part of the
data structure is traversed in any given iteration. As such, reuses
of data elements between consecutive iterations are separated by
only a small amount of data, instead of the complete data struc-
ture. A large number of transformations have been proposed that
all aim at increasing temporal locality in this way, such as loop
tiling [164], loop interchange [124] and loop chunking [30]. These
transformations will be called tiling-like optimizations.
• The use and reuse occur between different loops. A data structure is
traversed by the first loop, after which it is retraversed by the sec-
ond loop. The reuses can be brought closer together by only doing
a single traversal, performing computations from both loops at
the same time. This kind of optimization is known as loop fusion.
The required transformation will be called a fusion-like optimiza-
tion.
• The use and reuse occur inside one iteration of a loop. The use and
reuse occur at a close distance, inside the same basic block. If the
reuse distance needs to be made shorter, some simple reordering
of code in that basic block needs to be performed. This is a non-
loop-carried reuse.
Techniques to classify the reuse pairs in a program run and to pro-
3.3 Vectors, polyhedra and polytopes 31
duce the corresponding code refactoring (loop transformation) hints
are described in [43, 41, 66] and implemented in the SLO tool (Sug-
gestions for Locality Optimizations) [39].
Figure 3.4 shows the same reuse histogram as Figure 3.3(a), but with
the classification of the reuse pairs. It indicates that fusing loop i with
loop k is required to optimize the reuse pairs with the longest reuse
distance. Next, a tiling-like transformation on loop i can reduce the
distance of reuses carried by loop i. This histogram only is an indication
of the origin of reuse pairs. It does not guarantee that the suggested
transformations are possible and, if they are, will not worsen the reuse
distance of other pairs. With these techniques a priority ranking can be
given to different potential optimizations. For example, optimizing the
reuses carried by loop k is useless as long as the other, longer reuses are
not optimized.
3.3 Vectors, polyhedra and polytopes
Until now loop transformations have been used in somemagicalway to
improve the data locality. Before loop transformations and, more par-
ticularly, their mathematical representation are studied in more detail,
some concepts have to be introduced. The definitions and notations are
based on [55, 119, 40, 159]. A more general study on polytopes is found
in [170]. Some definitions will not be used until Chapter 4, but are put
in this section to clarify the interrelationships.
3.3.1 Vectors
To simplify the notations, implicit usage will be made of the isomor-
phism between the set of n-dimensional vectors and the set of column
matrices with n rows over a given field, e.g., Qn and Qn×1. This allows
to multiply matrices with vectors and write n-tuples as column matri-
ces. No distinction will be made between expressions as (i, j) and [i j]T
or
[
i
j
]
.
Partial ordering vectors is defined element-wise:
(a0, a1, . . . , ak) ≥ (b0, b1, . . . , bk)⇔ ai ≥ bi,∀i ∈ {0, 1, . . . , k} . (3.1)
Definition 3. Vectors can be totally ordered using the lexicographical or-
32 Loop transformations in the polyhedral model
dering ‘≺’:
(a0, a1, . . .) ≺ (b0, b1, . . .)
m (3.2)
∃k ∈ N | ak < bk ∧ ∀i < k, ai = bi .
The definition for ‘’ follows from this definition:
a  b⇔ a ≺ b ∨ a = b .
Note that the two vectors do not need to have the same length to be compared.
They do of course to be equal.
Some examples are (0, 3) ≺ (1, 2) and (1, 0) ≺ (1, 1, 0). With this
definition the ordering of, e.g., (0, 1) and (0, 1, 2) is undefined. Such
cases will be prevented in this work, though the definition could be
extended to include them. As such, the lexicographical ordering results
in a total ordering.
3.3.2 Non-parameterized polyhedra and polytopes
An equality can always be written as two conjugate inequalities:
ax = bp+ c⇔ ax ≤ bp+ c ∧ ax ≥ bp+ c .
As a result, a systemof equalities and inequalities can always be written
as a system with inequalities only.
Definition 4. A convex rational polyhedron P is defined by a finite set of
linear inequalities:
P = {x ∈ Qn | Ax ≥ c} , (3.3)
where A is a constant integer matrix and c is a constant integer vector. This
corresponds to the intersection of a finite set of closed linear half-spaces (each
bounded by a hyperplane). Formula 3.3 is called the implicit representation
of a polyhedron. A bounded polyhedron is called a polytope. A k-dimensional
polyhedron/polytope is called a k-polyhedron/polytope.
Note that the dimension k of a polytope can be smaller than the
dimension n of the space it is defined in. For example, the triangle
with vertices (1, 0, 0), (0, 1, 0) and (0, 0, 1) is a 2-dimensional polytope
embedded in a 3-dimensional space.
3.3 Vectors, polyhedra and polytopes 33
Some sources define a rational polyhedron as a subset of Rn instead
of Qn. The term rational only points to the fact that A and c are ratio-
nal (integer after multiplication with the least common multiple of the
denominators of the elements of both).
Recall that the line segment between the points A(x1, y1) and
B(x2, y2) can be written as
AB =
{
(x, y) ∈ Q2 |
[
x
y
]
= ν
[
x1
y1
]
+ (1− ν)
[
x2
y2
]
, ν ∈ [0, 1]
}
(3.4)
or
AB =
{
(x, y) ∈ Q2 |
[
x
y
]
=
[
x1 x2
y1 y2
] [
ν1
ν2
]
, νi ≥ 0,
∑
i
νi = 1
}
,
(3.5)
which is more intuitive than, but equivalent to using a system of in-
equalities as in (3.3) :

y2 − y1 x1 − x2
y1 − y2 x2 − x1
1 0
−1 0
0 1
0 −1


[
x
y
]
+


x2y1 − x1y2
x1y2 − x2y1
−min(x1, x2)
max(x1, x2)
−min(y1, y2)
max(y1, y2)


≥ 0 .
Polyhedra can be defined in a way similar to (3.5).
Definition 5. A polyhedron can also be represented using itsMinkowski (or
geometric) representation:
P =
{
x ∈ Qn | x = Lλ+Rµ+ V ν, µ ≥ 0, ν ≥ 0,
∑
i
νi = 1
}
. (3.6)
The columns of matrices L, R and V contain the lines, rays and vertices of
P , respectively.
A polytope is represented by the term V ν only, since the terms Lλ
and Rµ are not bounded. The term Lλ with unconstrained λ can be
omitted by replacing Lλ+Rµ by R′µ′ with R′ = [L − L R] and µ′ ≥ 0,
as demonstrated below. One can write
Lλ = Lζ − Lξ ,
34 Loop transformations in the polyhedral model
A
B
0 1 2 3 4
x
0
1
2
3
4
y
Figure 3.5: Graphical representation of the polyhedron in example 1.
with
ζi =
{
0 if λi < 0
λi if λi ≥ 0
, ξi =
{
−λi if λi < 0
0 if λi ≥ 0
,
and thus
Lλ+Rµ = [L − L R]

 ζξ
µ

 = R′µ′ ,
with ζ ≥ 0, ξ ≥ 0 and thus µ′ ≥ 0.
Example 1. Figure 3.5 shows the polyhedron P defined as
P =

(x, y) ∈ Q2 |

 1 0−1 0
1 1

[ x
y
]
≥

 1−3
3



 .
This can be written in a more compact way as
P =
{
(x, y) ∈ Q2 | 1 ≤ x ≤ 3, x+ y ≥ 3} .
The Minkowski representation is given by
P =

 (x, y) ∈ Q
2 |
[
x
y
]
=
[
0
1
]
µ+
[
1 3
2 0
] [
ν0
ν1
]
,
µ ≥ 0, νi ≥ 0,
∑
i νi = 1

 ,
which indicates the presence of one ray (0,1) and 2 vertices A (1,2) and B (3,0).
Definition 6. F is a face of a polyhedron P if and only if F is a non-empty
subset of P and
F = {x ∈ P | A′x = b′} , (3.7)
3.3 Vectors, polyhedra and polytopes 35
x
y
z
A B
CD
E
F
G
x
y
p
(a) (b)
Figure 3.6: The 3-polytope of example 2 (a) corresponds to the representation
of the parameterized 2-polytope of example 3 in the combined data/parame-
ter space (b). Intersections are shown for p equal to 3, 5 and 8.
for some subsystem A′x ≥ b′ ofAx ≥ b. Each face is a polyhedron itself, and is
called a k-face if it is a k-polyhedron. The 0-dimensional faces are the vertices
of the polytope.
Example 2. The 3-polytope in Figure 3.6(a) is defined as
P = {(x, y, z) ∈ Q3 | 0 ≤ x ≤ 5, 0 ≤ y ≤ 6, 0 ≤ z ≤ 11− x− y} ,
or
P =
{
V ν, νi ≥ 0,
∑
i
νi = 1
}
, V =

 0 5 5 0 0 0 50 0 6 6 6 0 0
0 0 0 0 5 11 6

 .
Its vertices (0-faces, columns of V ) are A (0,0,0), B (5,0,0), C (5,6,0), D (0,6,0),
E (0,6,5), F(0,0,11) and G (5,0,6). The other faces are the line segments (1-
faces) AB, BC, CD, DA, DE, EC, EF, FG, FA, GC, GB, the triangles and
quadrangles (2-faces) ABCD, CDE, ADEF, ABGF, CBG, CEFG, the polytope
P itself (3-face) and the empty set.
The 2-faces are found by adding one equality to the system of inequalities,
e.g., adding x = 0 leads to the face ADEF, adding x = 5 to CBG. These are the
intersections of the polyhedron with planes. The 1-faces are found by adding
two equalities, e.g., x = 0 ∧ y = 0 leads to AF, x = 5 ∧ y = 0 to BG. To find
the 0-faces, three equalities are needed, e.g., x = 0 ∧ y = 0 ∧ z = 11− x− y,
which determines F. The empty set is also a face, e.g., by using x = 0∧ x = 5.
36 Loop transformations in the polyhedral model
0 1 2 3 4 5 6
x
0
1
2
3
4
5
6
y
0 1 2 3 4 5 6
x
0
1
2
3
4
5
6
y
0 1 2 3 4 5 6
x
0
1
2
3
4
5
6
y
(a) p = 3 (b) p = 5 (c) p = 8
Figure 3.7: Parameterized polytope of example 3 for three parameter values.
3.3.3 Parameterized polyhedra and polytopes
Definition 7. A rational parameterized polyhedron Pp is a family of
polyhedra, defined as
Pp = P (p) = {x ∈ Qn | Ax ≥ Bp+ c} for p ∈ Qm , (3.8)
where A and B are constant integer matrices, c is a constant integer vector,
and p is a vector of parameters.
Example 3. The parameterized polytope
Pp =
{
(x, y) ∈ Q2 | 0 ≤ x ≤ 5, 0 ≤ y ≤ 6, 0 ≤ p, x+ y ≤ 11− p} ,
is shown in Figure 3.7 for 3 values of p. Figure 3.6(b) represents these poly-
topes as intersections of a polytope in the combined data/parameter space (cor-
responding to the polytope of example 2 with z = p) and planes with constant
parameter values p.
As indicated by example 3, each parameterized polyhedron can
be rewritten as a non-parameterized polyhedron in the combined
data/parameter space:
P ′ =
{[
x
p
]
∈ Qn+m | A′
[
x
p
]
≥ c
}
, (3.9)
where A′ = [A −B].
3.3 Vectors, polyhedra and polytopes 37
Table 3.2: Parametric vertices of Pp (example 3) with corresponding 1-faces
(line segments) of the polyhedron P ′ in the combined data/parameter space
(example 2).
Line segment (1-face) P ′ Vertex Pp Domain
DE (0, 6) 0 ≤ p ≤ 5
CE (5− p, 6) 0 ≤ p ≤ 5
CG (5, 6− p) 0 ≤ p ≤ 6
BG (5, 0) 0 ≤ p ≤ 6
EF (0, 11− p) 5 ≤ p ≤ 11
GF (11− p, 0) 6 ≤ p ≤ 11
AF (0, 0) 0 ≤ p ≤ 11
Let S(pˆ) be the subspace of Qn+m with constant parameter value
p = pˆ and projd the projection on the data space, i.e.
S(pˆ) =
{[
x
p
]
∈ Qn+m | p = pˆ
}
, (3.10)
projd : Q
n+m → Qn,
[
x
p
]
7→ x . (3.11)
Then,
Pp = projd(P
′ ∩ S(p)) . (3.12)
Theorem 1. The (parametric) vertices of a parameterized polyhedron Pp cor-
respond to projections on the data space of intersections of S(p) with the m-
faces of P ′, its representation in the combined data/parameter space (Equa-
tion 3.9), where m is the number of parameters [119].
Table 3.2 shows how the vertices of the 2-polytope of example 3 cor-
respond to the projections of intersections with line segments (1-faces)
of the 3-polytope in example 2. Each vertex does only exist within a cer-
tain parameter domain (the projection of the m-face on the parameter
space). The parametric vertices can be represented as affine functions
of the parameter(s). Some 1-faces, e.g., AB, do not have a correspond-
ing parametric vertex, as the projection of its intersection with S(p) is
not 0-dimensional.
From here on, the parameter values are restricted to integers:
p ∈ Zm. The set of integer points in a rational polyhedron is some-
times called an integer polyhedron [40]. However, more often this
38 Loop transformations in the polyhedral model
name (and also lattice polyhedron [24]) is reserved for a rational polyhe-
dron of which all the vertices have integer coordinates. Therefore, the
name Z-polyhedron/polytope is used to unambiguously denote the
intersection of a rational polyhedron/polytope with an integer lattice
(here always Zn), which corresponds to the set of integer points within
a rational polyhedron/polytope.
3.4 Program representation in the
polyhedral model
3.4.1 Statement and iteration domain
The concept of statement is known in all procedural programming lan-
guages and will not be defined here. It should be noted that a statement
is defined recursively, e.g., a compound statement is a statement com-
posed of a sequence of statements and a selection statement contains
statements preceded by a condition. Even a loop can be considered as
one statement.
Definition 8. The (lexical) depth of a loop or a statement is the number of
loops that surround it. A statement S is executed for a set of values of its iter-
ation vector, IS, the vector grouping the iterators of the surrounding loops.
This vector has a dimension equal to the depth, DS, of the statement. A single
execution of a statement is called a statement invocation. The iteration do-
main DS is the set of values of the iteration vector for which the statement S
is executed.
In this work only parts of a program with static control, Static Con-
trol Part or SCoP [57], are considered. These consist of (non-perfectly)
nested loops with constant strides and loop bounds that are linear
(affine) expressions of the iterators of the surrounding loops and some
(global) parameters. Conditional execution should be statically pre-
dictable and depend on Boolean expressions of affine inequalities. This
means that the iteration domains are not data dependent, but can be
determined at compile time as a function of the program parameters.
An iteration domain will be a parameterized Z-polytope or a union of
parameterized Z-polytopes.
Example 4. The code in Figure 1.7 contains three statements S1, S2 and S3.
However, the inner of the first loop nest can also be considered as a single com-
3.4 Program representation in the polyhedral model 39
#define S12(i,j) \
if (i==0) \
B[j]=1; \
B[i+j]=B[i+j]*A[i][j];
#define S3(k) \
C[k]=B[k]+B[k+1];
for (i=0;i<=N;i++){
for (j=0;j<=N-i;j++){
S12(i,j);
}
}
for (k=0;k<=N-1;k++){
S3(k);
}
Figure 3.8: Separation of statement definitions and loop control. The code is
equivalent to the code in Figure 1.7, but the choice to define two instead of
three statements was made.
pound statement, S12, as shown in Figure 3.8. In both figures the statements
are printed as functions of their iteration vectors, (i, j) and (k), respectively.
The iteration domains are given by:
DS1 = {(i, j) ∈ Z2|i = 0 ∧ 0 ≤ j ≤ N} (3.13)
DS2 = DS12 = {(i, j) ∈ Z2|0 ≤ i ≤ N ∧ 0 ≤ j ≤ N − i} (3.14)
DS3 = {(k) ∈ Z1|0 ≤ k ≤ N − 1}. (3.15)
3.4.2 Schedule vector
Definition 9. A program’s top-level (depth 0) is a sequence of loops and state-
ments which can be numbered from 0 onwards (Figure 3.9). Each loop in turn
contains a sequence of statements and loops that are also numbered from 0
onwards. Each statement is uniquely identified by the sequence of numbers
indicating the position in each of the surrounding loops and the top level.
This sequence is called the ordering vector.3 It has dimension DS + 1 with
DS the depth of statement S. The ordering vector can easily be constructed
from the Abstract Syntax Tree (AST) (Figure 3.10).
3Note that in this context ordering vector is completely different from the concept
mentioned in [159, Section 3.5].
40 Loop transformations in the polyhedral model
Depth Ordering Iteration Schedule
0 1 2 vector vector vector
| | |
0: for (i=0;i<=N;i++){
0: for (j=0;j<=N-i;j++){
0: if (i==0) S1(i,j); (0, 0, 0) (i, j) (0, i, 0, j, 0)
1: S2(i,j); (0, 0, 1) (i, j) (0, i, 0, j, 1)
}
}
1: for (k=0;k<=N-1;k++){
0: S3(k); (1, 0) (k) (1, k, 0)
}
Figure 3.9: Program example with indication of statement depth, ordering
vector and schedule vector.
0
0
0 1
0
1
for i for k
S3: C[k]=B[k]+B[k+1]for j
S2: B[i+j]=B[i+j]*A[i][j]if (i==0)
S1: B[i+j]=1
Program
Figure 3.10: Simplified abstract syntax tree (AST) of the code in Figure 3.9. The
ordering vector of a statement is formed by the labels of the branches taken
from the top node to the statement node. The schedule vector is constructed
in a similar way by also including the iterators of the traversed loop nodes.
The execution order of statement invocations depends on the posi-
tion in the program code, e.g., S1(0, 0) in Figure 3.9 is executed before
S2(0, 0), indicated by the ordering vectors, and on the value of the itera-
tors, e.g., S2(0, 0) is executed before S2(0, 1), indicated by the iteration
vectors. These two aspects are combined in the schedule vector.
Definition 10. The schedule vector of a statement is the vector with the ele-
3.4 Program representation in the polyhedral model 41
ments of the ordering vector as odd elements and the iterators as even elements
(Figure 3.9 and 3.10). The dimension is 2DS+1. The execution order of state-
ment invocations follows the lexicographical order of their schedule vectors.
The uniqueness of the ordering vectors ensures that the statement invocations
are strictly ordered.
This schedule vector is an instance of the more general concept
of schedule or schedule function [84, 85] (also called scattering func-
tion [27]). The schedule function of a statement Sx is a function of
its iteration vector, θSx(ISx), which defines the execution order of the
statement invocations by mapping each iteration on a time-stamp, i.e.
a logical execution time. A statement invocation Sx(i1, j1, . . .) precedes
another statement invocation Sy(i2, j2, . . .) if and only if θSx(i1, j1, . . .) ≺
θSy(i2, j2, . . .). For the example given above the schedule function is ex-
pressed as the schedule vector or
θS1(i, j) = (0, i, 0, j, 0)
θS2(i, j) = (0, i, 0, j, 1)
θS3(k) = (1, k, 0) .
By altering the schedule function (vector) the execution order of the
statement invocations will be changed. This is a way to describe loop
transformations in the polyhedral model. The schedule function may
be one-dimensional or multidimensional, as the schedule vector pre-
sented above. This is referred to as scheduling in one-dimensional or
multidimensional time, respectively [84, 85]. In this chapter the choice
of the schedule vectorwill always result in a total order of the statement
invocations. A partial order can also be used, e.g., to describe potential
parallelism. In chapter 4 one-dimensional schedules will be used that
map different invocations on the same time-stamp.
3.4.3 Dependences
If a statement invocation reads a data value that is written by another
statement, there is a dependence between those invocations and the ex-
ecutions should stay in the same order, to ensure correct execution re-
sults. Two invocations without dependences may change places. Three
types of dependences can be distinguished:
Read after Write (RaW) A data element is first written by one state-
ment invocation and then read by another statement invocation.
Also called producer-consumer or data dependence.
42 Loop transformations in the polyhedral model
for (i=l-1;i<=N-1;i++){
for (k=0;k<=l-1;k++){
if (k==0)
s=0;
s=s+a[i-k]*f[k];
if (k==l-1)
b[i]=s;
}
}
k
i
l − 1
l
N − 1
10 l − 1
RaW
WaR
(a) (b)
for (i=l-1;i<=N-1;i++){
for (k=0;k<=l-1;k++){
if (k==0)
s[i][0]=0;
s[i][k+1]=s[i][k]+a[i-k]*f[k];
if (k==l-1)
b[i]=s[i][k+1];
}
}
(c)
Figure 3.11: Program code (a) of example 5 for calculating a convolution with
geometric representation of true and false dependences as arrows between
points in the iteration domain (b). After conversion of the code to single as-
signment (c) the false dependences are removed.
Write after Read (WaR) A data element is read by one statement invo-
cation before its value is overwritten by another statement invo-
cation. Also called consumer-producer or anti-data dependence.
Write after Write (WaW) A data element is written by a first statement
invocation and then overwritten by another statement invocation.
Also called producer-producer or output dependence.
The first (RaW) is called a true dependence as there is a value passed from
the first to the second statement invocation. The other two cause a false
dependence since a same data location is chosen to contain two different
data values. If a different data location is chosen to contain the different
data values these dependences are removed.
A program in which each data location is written on at most once
is said to have the single assignment property [83]. Figure 3.22 gives an
example of single assignment code. Several techniques exist to rewrite
3.4 Program representation in the polyhedral model 43
a program into a single assignment form.
Example 5. Figure 3.11(a) shows a program fragment that calculates the con-
volution b of a signal a and a filter with coefficients fk according to Equa-
tion A.1. The “+=” operator is avoided for not hiding the read from the scalar
s. All dependences are caused by accesses to this scalar. The assignments
“s=0;” and “b[i]=s;” could be put in front of and behind the inner loop
but are put within the inner loop to have all the dependences in a single 2-D
space as shown in Figure 3.11(b). The WaR dependences can be removed by
expansion of the scalar s into a 2-D array as shown in Figure 3.11(c). As a
result, the different iterations of the outer loop can be executed in any order or
in parallel.
If the array indices are linear expressions of the iterators and param-
eters, the dependences between the statement invocations can also be
represented as the integer points in polyhedra or unions of polyhedra,
which are called dependence domains [31].
Definition 11. Consider the statements Sy and Sx. Their dependence do-
main, DSxδSy, using the simplified notation of [31], is defined as
Sy(i1) depends on Sx(i0)
m (3.16)
(i0, i1) ∈ DSxδSy ,
with Sy(i1) and Sx(i0) arbitrary statement invocations.
For the example of Figure 3.9 the non-empty dependence domains
are:
DS1δS2 = {(i0, j0, i1, j1) ∈ DS1 ×DS2|j1 = j0 ∧ i1 = i0}
DS2δS2 = {(i0, j0, i1, j1) ∈ D2S2|i0 + j0 = i1 + j1 ∧ i1 = i0 + 1}
DS2δS3 = {(i, j, k) ∈ DS2 ×DS3|k ≤ i+ j ≤ k + 1 ∧ j = 0} ,
from which the dependence domains for Figure 3.8 can simply be de-
rived since DS12δS12 =DS2δS2 and DS12δS3 =DS2δS3. Figure 3.14(a) shows
these dependences with arrows.
In fact, these dependences are still too restrictive. Using the asso-
ciativity and commutativity of the addition over R, the operations that
work on a certain element of array B (S2) can be executed in any order,
as long as they are all finished before the read of that element by an
44 Loop transformations in the polyhedral model
j
i
0
1
N
10 N
S1(0, N)
S2(i, j)
S3(N − 1)
j
i
0
1
N
10 N
S1(0, N)
S2(i, j)
S3(N − 1)
(a) (b)
Figure 3.12: Dependences of the statement invocations accessing B[N ] with-
out (a) and with (b) taking commutativity of the addition into account.
invocation of statement S3, and the element of B is first initialized by
the corresponding invocation of S1. With this in mind the dependence
domains would look like:
DS1δS2 = {(i0, j0, i1, j1) ∈ DS1 ×DS2|j0 = i1 + j1}
DS2δS2 = φ
DS2δS3 = {(i, j, k) ∈ DS2 ×DS3|k ≤ i+ j ≤ k + 1} .
This is illustrated in Figure 3.12 for a selection of the statement invo-
cations. The fact that more arrows are shown although there are less
dependencesmight be misleading. One should note that only the direct
dependences are shown and included in the dependence domains. A lot
of indirect dependences exist due to the transitivity of the dependences (if
B depends on A and C depends on B then C depends on A). The total
set of direct and indirect dependences is thus the transitive closure of
the dependence domains. Since when a transformation does not violate
direct dependences then also the indirect dependences are satisfied, it
suffices to study the direct dependences.
Writing statement S2 in a single assignment form would destroy
this extra freedom. Amethod to analyze commutativity using symbolic
analysis can be found in [141]. Also fractal symbolic analysis [129] is
under some conditions more powerful than dependence analysis. Typ-
ically, analysis techniques only give sufficient and not necessary con-
3.5 Loop transformations 45
~ex
~ey
(x, y)
~ey′
~ex′
(x′, y′)
~ex
~ey
(x′, y′)
(a) (b) (c)
[
x′
y′
]
=
[
0 1
−1 0
] [
x
y
]
(d)
Figure 3.13: The change of co-ordinate system (~ex′ = ~ey, ~ey′ = −~ex)(b) or the
linear transformation (rotation)(c) are both described by the same mathemati-
cal formula (d).
ditions for the semantic equivalence of two schedules of the statement
invocations.
In the rest of this work dependences will be based on memory ac-
cesses, without considering associativity and commutativity, unless
stated otherwise.
3.5 Loop transformations
The purpose of loop transformations is already discussed in Section 3.1:
reordering the execution of statement invocations to improve locality,
burst usage and parallelism. This section will show how loop transfor-
mations can be described using the polyhedral model and will demon-
strate on an example how the locality can be improved by applying a
sequence of loop transformations.
3.5.1 Principle and example
The starting point is a program from which the iteration domains of
the statements and their initial ordering and schedule vector can be de-
rived. Program transformations in the polyhedral model can be spec-
46 Loop transformations in the polyhedral model
ified by well chosen transformations of the schedule functions [27].
Theymodify the source polyhedra into target polyhedra containing the
same statement invocations but in a new coordinate system, thus with
a new lexicographical order. The new iterators are used to scan the
image of the domains under applying the schedule function.
With this definition the iteration domains are kept untouched but
only expressedwithin another co-ordinate system. In the informal defi-
nition of loop transformations sometimes terminologywill be used that
expresses a transformation of the iteration domain, while the actual im-
plementation only changes the schedule function. This is analogous to
the resemblance of linear transformations with a fixed co-ordinate sys-
tem and transformations of the co-ordinate system itself. They are both
describedwith the same formulas (Figure 3.13). However, the represen-
tation of some transformations in the polyhedral model requires trans-
forming both the iteration domains and the schedule vectors, e.g., when
new iterators augment the dimension of a statement domain (e.g., strip-
mine) or new statements are created (e.g., peeling) [90, 91].
Below, an example sequence of transformations will be described
that leads from the code in Figure 3.1 to the code in Figure 3.2. An
overview of the schedule vectors and statement invocation arguments
during the transformation sequence is found in Table 3.3. The cor-
responding polyhedral representations are found in Figure 3.14. The
original domains are shown in Figure 3.14(a). The iterations of S2 are
represented by circles and those of S3 by stars. The arrows indicate the
dependences through the reuse of elements of array B. The schedule
vectors of the statements are 5- and 3-dimensional, respectively, and
are projected to a common 2-dimensional space. The execution order is
from left to right and from top to bottom (scan-line order).
The reuses in the first loop nest can be put along one axis by a loop
transformation called skewing (Table 3.4). This transformation alters
the schedule of statement S2:
θS2(i, j) = (0, i, 0, i + j, 0) .
With the introduction of new iterators i′ = i and j′ = i + j this is ex-
pressed as the interleaving of the old ordering vector (0, 0, 0) and a new
iteration vector (i′, j′)with correspondingmapping of the statement in-
vocations between the two iteration spaces:
θS2(i
′, j′) = (0, i′, 0, j′, 0)
S2(i, j) 7→ S2(i′, j′ − i′) .
3
.5
L
o
op
tra
n
sfo
rm
atio
n
s
47
Table 3.3: Schedule vector during the example sequence of transformations (Figure 3.14) expressed as a function of the
original iterators and as a function of new iterators that scan the image of the domains under the schedule function, i.e. the
time domain, in lexicographical order. The statement invocations are mapped to corresponding invocations in function of the
new iterators.
(a) (b) (c) (d) (e)
Original After skewing After interchange After fusion After shift
θS2(orig it) (0, i, 0, j, 0) (0, i, 0, i + j, 0) (0, i+ j, 0, i, 0) (0, i + j, 0, i, 0) (0, i + j, 0, i, 0)
θS2(new it) (0, i, 0, j, 0) (0, i′, 0, j′, 0) (0, i′′, 0, j′′, 0) (0, i′′, 0, j′′, 0) (0, i′′, 0, j′′, 0)
S2(i, j) 7→ S2(i, j) S2(i′, j′ − i′) S2(j′′, i′′ − j′′) S2(j′′, i′′ − j′′) S2(j′′, i′′ − j′′)
θS3(orig it) (1, k, 0) (1, k, 0) (1, k, 0) (0, k, 1) (0, k + 1, 1)
θS3(new it) (1, k, 0) (1, k, 0) (1, k, 0) (0, k, 1) (0, k′, 1)
S3(k) 7→ S3(k) S3(k) S3(k) S3(k) S3(k′ − 1)
48 Loop transformations in the polyhedral model
j
i
k
0
1
N
10 N
0
1
N
S2(i, j)
S3(k)
j′
i′
k
0
1
N
10 N
0
1
N
j′′
i′′
k
0
1
N
10 N
0
1
N
(a) (b) (c)
j′′
i′′, k
0
1
N
10 N
j′′
i′′, k′
0
1
N
10 N
(d) (e)
Figure 3.14: Polyhedral representation of the statements in Figure 3.1 after
each step in a sequence of transformations. The 5- and 3-dimensional schedule
vectors are projected to a common 2-D space. The execution order is from left
to right and from top to bottom (scan-line order). Original code (a) and after
skewing (b), interchange (c), fusion (d) and shift (e). The (dashed) arrows
indicate (violated) dependences.
The iteration domain in the new (i′, j′) space is shown in Figure 3.14(b).
Now, the reuses occur between different iterations of the outer loop.
By interchanging the two loops, the reuses occur between different it-
erations of the inner loop and inside one iteration of the outer loop
(Figure 3.14(c)). The locality of the reuses of B in the first loop nest is
now optimized. This skewing and interchange corresponds to the tile i
transformation proposed in Figure 3.4.
3.5 Loop transformations 49
The other major suggestion in Figure 3.4 is to fuse the loops i and
k (Figure 3.14(d)), done by adjusting the ordering vector of S3. This
transformation results in an illegal execution order. Several elements
are read before they receive the correct value as indicated by the dashed
arrows. This can be solved by shifting the iterations of statement S3
over one iteration (Figure 3.14(e)).
The final result corresponds to the code in Figure 3.2 (disregarding
the names of the iterators) and the histogram in Figure 3.3(b), which
proves the improvement in data locality.
Note that the new iterators do not have to be introduced after each
transformation step. It is possible to perform all transformations on
the schedule function using the original iterators. Only during the code
generation phase after the last transformation step, new iterators, which
scan the images (in the time domain) of the iteration domains under
the schedule functions, are introduced.
3.5.2 Available infrastructure
For the experiments described in this work, usewasmade of theWRaP-
IT, URUK tool suite [57, 29, 91] and the CLooG code generator [27, 28],
developed at several French laboratories (INRIA Futurs, LRI (Paris-
Sud), PRiSM (Versailles), . . . ). These tools are available under the GNU
General Public License (GPL).
Loop transformations
In the polyhedralmodel loop transformations are described as transfor-
mations on the matrices representing the iteration domains and sched-
ule functions. A way is needed to transform a program into a polyhe-
dral representation and apply the transformations, preferably specified
at a higher level than matrix operations.
The WRaP-IT infrastructure (WHIRL Represented as Polyhedra
- Interface Tool) translates a program from the WHIRL intermedi-
ate representation (Winning Hierarchical Intermediate Representation
Language) of the Open64/ORC (Open Research Compiler) [5] into a
polyhedral representation, WHIRL Represented as Polyhedra (WRaP),
and back again (after transformations).
The URUK tool (Unified Representation Universal Kernel) applies
loop transformations using the WRaP intermediate format. The loop
50 Loop transformations in the polyhedral model
Table 3.4: List of common loop transformations [165]. The first 5 transforma-
tions are elementary. Tiling can be constructed ad-hoc from elementary trans-
formations. More detailed descriptions with examples are found in [165, 90]
Name Effect
Loop Fusion Combining two adjacent loops into one loop
Loop Peeling Loop peeling removes (peels off) the first (or last)
iteration of a loop and places it into separate code.
This can be generalized into peeling off several it-
erations of the loop.
Loop Interchange Interchanging two nested loops switches the inner
and outer loop.
Strip-mine Strip-mining decomposes a single loop into two
nested loops. The resulting inner loop (the element
loop) executes a fixed number of consecutive iter-
ations of the original loop, called a strip. The outer
loop (the strip loop) iterates over the strips.
Skewing Skewing changes the iteration vectors for each iter-
ation by adding (a multiple of) one loop index value
to another loop index.
Tiling Tiling rewrites the program as an iteration over
blocks (tiles), each executing a fraction of the state-
ment invocations. This is an extension of strip-
mining to multiple nested or non-nested loops. It
is usually not achieved by a single transformation
but needs several transformation steps, e.g., strip-
mine and interchange in a perfect loop nest.
skew(enclose(S2),1,2,1)
interchange(enclose(S2))
fusion(enclose(S2,2))
shift(S3,{[0,1]})
Figure 3.15: URUK script of the transformation sequence shown in Figure 3.14
transformation sequence can be specified in a script with each line
containing one elementary transformation step. Figure 3.15 shows
the script corresponding to the transformation sequence described in
Section 3.5.1. Next to the elementary transformations delivered with
3.5 Loop transformations 51
URUK, new user-defined transformations can be used after specifying
them in a “.def” file. This file type allows to declare new elementary
transformations as a combination of other elementary transformations
and operations on the matrices representing the domains and schedule
functions.
A detailed overview of the implementation of the transformation
process and the definition of many common loop transformation in the
polyhedral model is given by [90, 91]. The general form of a schedule
function used is
θS (IS) = ΘS

 ISIgp
1

 , (3.17)
with IS the vector of iterators, Igp the vector of global parameters and
ΘS a matrix of the form
4
ΘS =


0 . . . 0 β0
Φ1,1 . . . Φ1,d+dgp Φ1,d+dgp+1
0 . . . 0 β1
Φ2,1 . . . Φ2,d+dgp Φ2,d+dgp+1
...
. . .
...
...
Φd,1 . . . Φd,d+dgp Φd,d+dgp+1
0 . . . 0 βd


. (3.18)
The even lines of ΘS (counting from 0) represent the sequential order-
ing of statements as described by the ordering vector. This means these
lines contain 0 in the first d + dgp columns (d = DS = number of itera-
tors, dgp =number of parameters) and an element of the ordering vec-
tor (called βS). The odd lines express the order of iterations. The first
d columns contain the coefficients of the iterators and the next dgp + 1
the coefficients of the parameters and constant. The matrix Φ, used in
(3.18), is therefore split into the matrices A and Γ:
Φ =


A1,1 . . . A1,d Γ1,1 . . . Γ1,dgp+1
A2,1 . . . A2,d Γ2,1 . . . Γ2,dgp+1
...
. . .
...
...
. . .
...
Ad,1 . . . Ad,d Γd,1 . . . Γd,dgp+1

 . (3.19)
52 Loop transformations in the polyhedral model
SW SW
Loop Xform
loop.cloop.cloog
defines.h
control
statements
CLooG
Figure 3.16: The Chunky Loop Generator (CLooG) generates (control) code
from a polyhedral representation (*.cloog). Together with the statement defi-
nitions (*.h) this results in executable software.
Code generation
After all transformations have been applied, the program is still rep-
resented in the polyhedral model, i.e. a set of statements each with a
domain and schedule function. Code has to be generated that iterates
over the statement invocations in such a way that the lexicographical
order defined by the schedule function is respected. The Chunky Loop
Generator (CLooG) [28, 27, 26] performs this task. The basic concept
of code generation consists of projecting the images of the iteration do-
mains under the schedule functions on the several axes, corresponding
to the (new) iteration dimensions.
A version of CLooG, called UrGenT (Uruk GeneraTor), is used by
WRaP-IT to generate WHIRL after loop transformations. CLooG also
exists as a stand-alone tool and library. The generated code (similar to
the lower part of Figure 3.8) has to be extended with statement defini-
tions (similar to the upper part of Figure 3.8) to create executable code
(Figure 3.16).
A trade-off between code size and run-time control complexity can
be made as illustrated by the following example:
Example 6. Consider two statements with the same square iteration domain
but with an offset (2, 2) forced by the schedule function (Figure 3.17).
DS1 = DS2 = {(i, j) ∈ Z2 | 0 ≤ i ≤M, 0 ≤ j ≤M}
θS1 = (0, i, 0, j, 0)
θS2 = (0, i + 2, 0, j + 2, 1)
Code generated by CLooG v. 0.12.2 is shown in Figure 3.18 for two settings
of the “-f” control optimization option. Without control optimization (“-f
4In [90, 91] the statement labels are written as superscript instead of subscript.
3.5 Loop transformations 53
p3
p1
0
2
M
M + 2
0 2 M M + 2
Figure 3.17: Two square domains of the statements in example 6. The dotted
lines indicate the convex hull of the union of the domains.
-1”) (Figure 3.18(a)) the iterators scan the convex hull (dotted lines in Fig-
ure 3.17) of the union of the two domains and the statements are guarded by
conditions that check if the iterator values are in the iteration domain of the
corresponding statement. Note that for two iterations, (p1, p3) = (M + 1, 1)
and (p1, p3) = (1,M + 1), none of the statements are invoked, which leads to
useless execution time. Also the run-time evaluation of the conditional guards
causes an overhead. With control optimization (“-f 1”) the code is much
longer (Figure 3.18(b)) but no time is wasted at run-time with useless itera-
tions and condition checking.
The situation of domains that are roughly the same and only have
non-overlapping parts at the borders is very common after loop trans-
formations, e.g., caused by shifts that fix violated dependences. Also
many image processing algorithms have irregular behavior at the bor-
ders compared to the body of the image.
Automation of loop transformations in this work
In this chapter all loop transformations and sequences of them are spec-
ified manually, directly in the polyhedral model or at a higher level us-
ing URUK scripts. The application of them is automated by the tools
mentioned above. Only in Chapter 4 transformations, or to be more
exact schedules, will be constructed in an automated way.
54 Loop transformations in the polyhedral model
for (p1=0;p1<=M+2;p1++) {
for (p3=max(p1-M,0);p3<=min(p1+M,M+2);p3++) {
if ((p1 <= M) && (p3 <= M)) {
S1(p1,p3) ;
}
if ((p1 >= 2) && (p3 >= 2)) {
S2(p1-2,p3-2) ;
}
}
}
(a)
for (p1=0;p1<=1;p1++) {
for (p3=0;p3<=M;p3++) {
S1(p1,p3) ;
}
}
for (p1=2;p1<=M;p1++) {
for (p3=0;p3<=1;p3++) {
S1(p1,p3) ;
}
for (p3=2;p3<=M;p3++) {
S1(p1,p3) ;
S2(p1-2,p3-2) ;
}
for (p3=M+1;p3<=M+2;p3++) {
S2(p1-2,p3-2) ;
}
}
for (p1=M+1;p1<=M+2;p1++) {
for (p3=2;p3<=M+2;p3++) {
S2(p1-2,p3-2) ;
}
}
(b)
Figure 3.18: Code generated for the statements of example 6, with control
optimization options “-f -1” (a) and “-f 1” (b).
3.6 Composing sequences of loop transformations
3.6.1 Introductory example
In the example transformation sequence of Section 3.5.1 doing the shift
before the loop fusion would have led to the same result. In general,
3.6 Composing sequences of loop transformations 55
S1
S2
j
10 N − 1 2N
j
10 2N
i
1
0
k
1
0
for j=0, 2N
S1(j);
for j=0, N-1
S2(j);
for i=0, floor(N/2)
for j=4i, min(2N,4i+3)
S1(j);
for k=0, floor((N-1)/2)
for j=2k, min(N-1,2k+1)
S2(j);
(a) (b)
j
10 2N
i
−1
1
0
j
10 2N
i
−1
1
0
for i=0, floor((N-1)/2)
for j=4i, min(2N,4i+3)
S1(j);
for j=2i, min(N-1,2i+1)
S2(j);
if (N % 2 == 0)
S1(2N);
for j=0,3
S1(j);
for i=0, floor((N-2)/2)
for j=4i+4, min(2N,4i+7)
S1(j);
for j=2i, 2i+1
S2(j);
if ((N-1) % 2 == 0)
S2(N-1);
(c) (d)
Figure 3.19: Geometrical representation of a sequence of transformations, re-
sulting in a tiled variant. N = 4. The corresponding code blocks (with sim-
plified for loop syntax as defined on page 56) are shown only for clarification
and are normally not produced until the final code generation. A dotted line
represents an execution point and the arrows that are cut by this line represent
the data elements that have to be stored at that moment. (a) original code, (b)
after strip-mine, (c) after fusion, (d) after shift.
however, changing the order of transformation steps may have an im-
pact as is demonstrated by the following example.
Consider the following piece of code:
56 Loop transformations in the polyhedral model
// Schedule
for (j=0;j<=2*N;j++) // vector
B[j]=A[2j]+A[2j+1]; //S1(j) (0,j,0)
for (j=0;j<=N-1;j++)
C[j]=B[2j]+B[2j+1]+B[2j+2]; //S2(j) (1,j,0)
A geometrical representation of the iteration domains of the two state-
ments is shown in Figure 3.19(a). The horizontal axis shows iterator j,
and the vertical is a projection of the ordering vector. The execution
order is in scan-line order: from left to right and from top to bottom.
The data dependences are indicated by arrows. It shows that for the
original code all operations within one loop are independent.
This variant of the code exhibits poor data locality. First, all ele-
ments of B are produced by the first loop. Then, all elements are con-
sumed by the second loop (dotted line in Figure 3.19(a)). The locality
can be improved by interleaving the operations of the two loops, e.g.,
by a loop transformation called tiling. This transformation reorders the
program as an iteration over blocks (or stripes in the 1-D case) in which
small parts of both loops are executed (Table 3.4). Figure 3.19 shows
the transformation steps leading to a tiled variant. The corresponding
code blocks are shown only for clarification and are normally not pro-
duced until the final code generation. For compactness the C syntax
of a loop statement “for(i=l;i<=u;i++)” is abbreviated to a FOR-
TRAN like syntax “for i=l, u”.5 The tile size is chosen such that
each tile executes 4 iterations of S1 and 2 iterations of S2.
Strip-mining (Figure 3.19(b)) partitions the iteration domains into
strips by adding extra iterators, i and k, respectively. From here, the
schedule vectors are 5-dimensional and again projected on a common
2-dimensional space with scan-line execution order. After fusion (Fig-
ure 3.19(c)), invocations of both statements are executed in each tile.
The dashed arrows indicate that some data dependences are violated.
Each tile consumes an element that is only produced in the next tile.
This can be solved by shifting S1 over one tile (Figure 3.19(d)). A dotted
line represents an execution point and has the executed points at one
side and the still to be executed points at the other side. The execution
of the program can be seen as a shift of this line, invoking statement
instances that pass from one side to the other. The arrows that are cut
by this line represent the data elements that are alive and have to be
stored at that moment.
5Real FORTRAN uses “DO” instead of “for” loops.
3.6 Composing sequences of loop transformations 57
S1
S2
j
10-1 2N
j
10-1 2N
for j=0, 2N
S1(j);
for j=0, N-1
S2(j);
for j=-1, 2N-1
S1(j+1);
for j=0, N-1
S2(j);
(a) (b)
j
10−1 2N
i
1
0
−1
k
1
0
j
−1 10 2N
i
−1
1
0
for i=-1, floor((2N-1)/4)
for j=max(4i,-1), min(2N-1,4i+3)
S1(j+1);
for k=0, floor((N-1)/2)
for j=2k, min(N-1,2k+1)
S2(j);
S1(0);
for i=0, floor((N-1)/2)
for j=4i, min(2N-1,4i+3)
S1(j+1);
for j=2i, min(N-1,2i+1)
S2(j);
(c) (d)
Figure 3.20: The sequence of Figure 3.19 can be improved by doing a shift
before the strip-mine and fusion. This results in a better data locality. N = 4.
(a) original code, (b) after shift, (c) after strip-mine, (d) after fusion.
After these transformations the locality has improved but is not op-
timal yet. Eight results produced by S1 (originally 2N + 1) have to be
stored before they are read by S2 (dotted line in Figure 3.19(d)) while
five would have been sufficient (dotted line in Figure 3.20(d)). Better re-
sults are obtained by doing a shift (Figure 3.20(b)) before the strip-mine
(Figure 3.20(c)) and fusion (Figure 3.20(d)).
3.6.2 Duplication of statement invocations
In the example in the previous section each tile consumes data gener-
ated by the previous tile and by the tile itself. As a result, the tiles can
only be executed in the order indicated by the iterator i.
58 Loop transformations in the polyhedral model
The tiles can be made independent by duplicate production of
shared data elements. This introduces a trade-off between parallelism
(number of independent tiles) and computation cost. If the tiles are
large the impact of the duplicated calculations is relatively small.
After skewing (Figure 3.21(a)), all tiles but the first and possibly also
the last (if N is odd), have the same shape. The data elements read by
2 different tiles are now produced by the invocations of S1 for which
j = 3. In the figures only 2 full tiles are shown (N = 4). For larger
N there will be more tiles (⌈N/2⌉) and shared elements (⌈N/2⌉-1). To
make the tiles independent the statement invocations producing the
shared elements are duplicated. A new statement is constructed: S1′,
indicated with a star, which is a copy of S1 with only the invocations
for which j = 3. This can be regarded as peeling off those invocations
without removing them from the loop, i.e., peeling with duplication
(Figure 3.21(b)). Each duplicate is shifted to the next tile and fused with
the first loopwithin that tile (Figure 3.21(c)). The tiles are now indepen-
dent and can run in parallel. Two useless statement invocations (1 if N
is odd), are removed by adding constraints to the systems representing
the iteration domains of S1 and S1′ (Figure 3.21(d)).
Finally, executable code can be generated from the polyhedral rep-
resentation:
for (i=0;i<=floor((N-1)/2);i++){
B[4i]=A[8i]+A[8i+1]; //S1’(4i);
for (j=0;j<=min(3,2N-4i-1);j++)
B[j+4i+1]=A[2j+8i+2]
+A[2j+8i+3]; //S1(j+4i+1);
for (j=0;j<=min(1,N-2i-1);j++)
C[j+2i]=B[2j+4i]+B[2j+4i+1]
+B[2j+4i+2]; //S2(j+2i);
}
The code generator considers S1 and S1′ as different statements. There-
fore, the code is larger than needed. Knowing that they are copies of
each other it is possible to remove the line with S1′ and let the iterator
j around S1 start from −1 instead of 0.
3.6.3 Commutativity of elementary loop transformations
The examples in Section 3.5.1 and 3.6.1 illustrate that sometimes the
order of transformation steps is important and sometimes it is not. In
3.6 Composing sequences of loop transformations 59
j
−1 10
i
−1
1
0
S1’
j
−1 10
i
−1
1
0
S1(0);
for i=0, floor((N-1)/2)
for j=0, min(3,2N-1-4i)
S1(j+4i+1);
for j=0, min(N-1-2i,1)
S2(j+2i);
S1(0);
S1’(0);
for i=0, floor((N-2)/2)
for j=0, 3
S1(j+4i+1);
S1’(4i+4);
for j=0, 1
S2(j+2i);
if ((N-1) % 2 == 0)
for j=0, 1
S1(j+2N-1);
S2(N-1);
(a) (b)
j
−1 10
i
−1
1
0
j
−1 10
i
1
0
S1(0);
for i=0, floor((N-1)/2)
S1’(4i);
for j=0, min(3,2N-1-4i)
S1(j+4i+1);
for j=0, min(1,N-1-2i)
S2(j+2i);
if (N % 2 == 0)
S1’(2N);
for i=0, floor((N-1)/2)
S1’(4i);
for j=0, min(3,2N-1-4i)
S1(j+4i+1);
for j=0, min(1,N-1-2i)
S2(j+2i);
(c) (d)
Figure 3.21: The tiles can be made independent by duplication of shared data
elements. N = 4. They now can run in parallel at the cost of doing some
calculations twice. (a) after skewing, (b) after peeling with duplication, (c)
after shift and fusion, (d) after adding constraints.
60 Loop transformations in the polyhedral model
this section this is discussed in more detail. Two elementary transfor-
mations that affect different statements or disjoint sets of statements
do always commute (in the polyhedral representation). Therefore, this
case is not considered below. We consider the permutation of trans-
formations that affect at least one statement in common and find four
cases:
1. Some transformation steps can be applied in any order without
altering the description of the transformation steps. For example,
two shifts, a shift and a fusion, two loop interchanges that work
on distinct couples of iterators.
2. Some transformations can be applied in a different order with an
appropriate adjustment of the operands. For example, a unimod-
ular transformation (multiplication of A and Γ in (3.19) with a
unimodular matrix U , e.g., loop interchange and skewing) pre-
ceded by a shift (addition of a vectorM to Γ) can be written as the
same modular transformation followed by a transformed shift.
A′ = UA
Γ′ = U(Γ +M) = UΓ + UM = UΓ +M ′
3. In some cases the expressiveness of one order of two kinds of
transformations is larger than the other. For example, one shift
before a peeling can be expressed as two (identical) shifts after
the peeling, working on the main loop and the peeled-off part, re-
spectively. In general, shifts after the peeling can not be replaced
by shifts before the peeling as the latter can not control the devia-
tion between the two different statements created by the peeling.
4. For a last class of cases, each order of two kinds of transforma-
tions can express transformations that can not be expressed by
the other order. As shown in the example in Section 3.6.1 a shift
before a strip-mine can control the grouping of statement invoca-
tions in the tiles, which is not possible with shifts after the strip-
mine. However, the strip-mine introduces a new variable and
only shifts after the strip-mine can independently affect the ele-
ment and strip loop iterator (Table 3.4).
3.7 Useful transformation sequences for the 2-D IDWT 61
3.6.4 A canonical order
One problem in automated search space exploration of loop transfor-
mation is that different transformation sequences may lead to the same
result. This is, a.o., caused by the fact that some loop transformations
commute (possibly with adjustments in the operands). Therefore, it is
worthwhile to wonder if some kind of canonical order of transforma-
tion steps could be constructed. If such an order would exist it would
be easier to avoid the generation of different sequences that have equiv-
alent results.
Under certain restrictions a canonical order exists. For example,
when the transformations are restricted to unimodular transformations
and shifts (cf. case 2 in Section 3.6.3). However, in the general case,
when the number of statements and iterators may change, the situation
becomes much more complex. Then, at least a partial canonical order can
be constructed. For example, if the order of all transformation steps
that are not shifts is given, the shifts can be moved over other steps in
the transformations sequence until only shifts remain at the beginning
of the sequence and after each transformation step that is of a kind for
which not having a shift after it would reduce the expressiveness (cf.
cases 3 and 4 in Section 3.6.3). This has been used to examine the dif-
ferences between the transformation sequences LB HV and LB VH and
between RC HV and RC VHmentioned at the end of Section 3.8.
3.7 Useful transformation sequences
for the 2-D IDWT
Figure 3.22 shows pseudo code of a Row-Column-wise 2-D IDWT, cor-
responding to the manual design described in Section A.3. This algo-
rithm is far from optimal with regard to several criteria, such as data
locality or burst mode usage. Therefore, it has to be modified in ac-
cordance to the desired properties. Variants with a better data locality
can be constructed by transforming the original variant. The purpose
is to reduce the external memory bandwidth, by retaining data that is
frequently used in on-chip buffers. To effectively exploit the created lo-
cality, obviously memory allocation andmapping of the data is needed.
We refer to [158] for a possible approach.
Many variants of the (I)DWT have been described in the literature.
An overview is found in Section A.5. Since none of the cited papers
62 Loop transformations in the polyhedral model
1: for l = k, 1
2: Rl = R / 2
l
3: Cl = C / 2
l
4: for j = 0, 2Cl − 1 // Vertical filtering.
5: for i = 0, Rl − 1
6: Bl,2i,j =
∑1
n=−1Al,i+n,j h3−2n +
∑1
n=−2 Al,Rl+i+n,j g3−2n
7: Bl,2i+1,j =
∑2
n=−1Al,i+n,j h4−2n +
∑2
n=−2 Al,Rl+i+n,j g4−2n
8: for i = 0, 2Rl − 1 // Horizontal filtering.
9: for j = 0, Cl − 1
10: Al−1,i,2j =
∑1
n=−1Bl,i,j+n h3−2n +
∑1
n=−2Bl,i,Cl+j+n g3−2n
11: Al−1,i,2j+1 =
∑2
n=−1Bl,i,j+n h4−2n +
∑2
n=−2Bl,i,Cl+j+n g4−2n
Al =
[
LLl HLl
LH l HHl
]
, Bl = [Ll Hl] for l ≥ 1 and A0 = LL0.
Figure 3.22: Simplified representation of the Row-Column-wise 2-D IDWT.
R and C are parameters representing the number of rows and columns of
the image that is transformed over k levels. For each level l, Al and Bl are
two-dimensional arrays used to store the different subbands. The vectors g
and h contain the filter coefficients, i.e., the impulse response, of the wavelet
synthesis filters G and H. They have lengths 9 and 7, respectively.
mentions the path to a certain implementation it is assumed that they
are all constructed by doing code transformations manually. This task
is error-prone, time consuming andmakes it hard to exploremany vari-
ants of an implementation. As mentioned in Section 3.5.2, tools exist to
automate the application of loop transformations. The user only has
to specify the sequence of transformations to perform. This sequence
can be found by manual analysis of the code, possibly aided by tools,
such as SLO (Section 3.2.2). The transformation tools can also verify the
correctness of the result of a transformation by verifying that no data
dependences are violated.
We would like to use the polyhedral model to improve the locality
of the IDWT. However, the code in Figure 3.22 contains an exponen-
tial function in the loop boundaries (Rl and Cl) and therefore it does
not fit into the model since the loop bound expressions are not linear.
This problem can be resolved in two ways. If the number of levels k is
known the outer loop can be unrolled. Else one can restrict the polyhe-
dral model to the body of the outer loop, where 2l becomes a parameter,
3.7 Useful transformation sequences for the 2-D IDWT 63
Table 3.5: Primary transformations expressed as a sequence of elementary
transformations. The first three generate IDWT variants starting from the orig-
inal row-column-based variant. The fourth transforms a tiled variant, created
after applying Tile V or Tile H (or both as in Section 3.8), into a variant with
overlapping tiles. The numbers indicate the order of the transformation steps.
Steps between brackets may become unnecessary when combining several of
these sequences.
St
re
tc
h
In
te
rc
ha
n
ge
Fu
si
o
n
Sh
ift
St
rip
-
m
in
e
Pe
e
lin
g
Pe
e
lin
g
w
ith
du
pl
ic
a
tio
n
Sk
ew
Ad
d
Co
n
st
ra
in
t
RC2LB 1 2 3 (4)
Tile V (2) 4 3 1
Tile H 4 6 2,5 3 1
Overlap 4 3 2 (1) 5
and put the result after transformation back inside the l-loop. Writing
the program in a single assignment form adds flexibility by removing
false dependences.
The poor locality of the Row-Column-Based IDWT can be improved
in several ways. The accesses of the horizontal and vertical filtering
can be brought closer together but also the accesses of the different
transformation levels (loop l). Below we will discuss three transforma-
tion sequences, each improving the locality of the original algorithm
in a different way. We will call them primary sequences. A fourth se-
quence (Overlap) starts from a tiled variant, to make the tiles overlap-
ping. Table 3.5 shows how the sequences are composed of elementary
loop transformations (cf. Table 3.4). A similar table can be constructed
for the DWT. Section 3.8 will show that the primary sequences can be
combined to form larger sequences.
3.7.1 From row-column-based to line-based (RC2LB)
Initially, the horizontal and vertical filtering scan the image in differ-
ent directions. Doing the vertical filtering line by line (interchange of
64 Loop transformations in the polyhedral model
LH1 HH1
HL1LL1
H1L1 LL0
Vert. xf Hor. xf
IDWT IDWT
Figure 3.23: Vertical filtering in a horizontal scan order leads to a line-based
IDWT
IDWT IDWT
(a) (b)
Figure 3.24: Vertical and horizontal tiling
the loops on line 4 and 5 of Figure 3.22, left side of Figure 3.23), allows
to interleave (or execute in parallel through pipelining) the horizontal
and vertical filtering operations (fusion of the loops with iterator i). As
a result less data has to be stored; only several lines instead of an en-
tire frame. Stretching the vertical filtering in the vertical direction, i.e.,
doubling the range of iterator i fromRl to 2Rl (line 5 in Figure 3.22) and
only execute for even values, is needed to adjust the rate of the vertical
transformation to the rate of the horizontal transformation, for which
iterator i goes from 0 to 2Rl − 1. A shift is needed where dependences
are violated. The resulting line-based IDWT (LB) is still done level by
level. Coarse-grain parallelism can be introduced by pipelining the dif-
ferent transformation levels as in [54].
3.7.2 Vertical tiling (Tile V)
Vertical tiling divides the iteration domains into horizontal stripes,
with heights proportional to the subband they are contained in (Fig-
ure 3.24(a)). Operations are not done level by level but for all levels
within a set of corresponding stripes, one in each subband. This is
reached by strip-mining all loops with iterator i by a factor propor-
tional to Rl, so that each subband is divided into the same number
of stripes. Loop interchanges are needed to make the resulting strip
3.7 Useful transformation sequences for the 2-D IDWT 65
iterators (Table 3.4), which iterate over the stripes within one subband,
the most significant, i.e., outside the unrolled l-loop. After fusion there
remains only one strip iterator that iterates over sets of data consisting
of one stripe in each subband (Figure 3.24(a)). Shifts restore violated
dependences. This sequence is similar to the tiling demonstrated in
Figure 3.20. We will call this variant a vertically tiled IDWT (Tile V).
Since this loop transformation sequence combines operations of differ-
ent wavelet transformation levels, it can only be applied in the polyhe-
dral model if the number of levels k is fixed and the l-loop is unrolled.
However, for any value of k a tiling transformation sequence can be
constructed. Manually, it is possible to construct a tiled implementation
where k is a parameter again.
3.7.3 Horizontal tiling (Tile H)
Horizontal tiling is analogous to vertical tiling except for an initial peel-
ing and shift. After naively strip-mining the loops with iterator j, the
first half of the stripes would do a vertical transform on elements of the
LLl and LHl subbands, generating elements of the Ll, and the other
half would generate the elements of Hl. Horizontal filtering in vertical
stripes (Figure 3.24(b)) is only possible if corresponding stripes in theLl
andHl band (Cl apart) are produced simultaneously. Else, the horizon-
tal transformation could only start after more than half of the stripes
are processed by the vertical filtering, i.e., when samples from both the
Ll and Hl band are available. Therefore, the left and right halves of
the vertical filtering domains (first and last Cl iterations of line 4 in Fig-
ure 3.22), producing the Ll andHl band, respectively, are split (peeling)
and interleaved (shift of the right half over −Cl).
3.7.4 Overlap
After Vertical or Horizontal tiling, some intermediate results at the
edges of a stripe are also used by neighboring stripes, potentially lead-
ing to large overlap buffers and inter-stripe dependences. To make
the different blocks or stripes independent of each other the statement
invocations calculating these data can be duplicated, generating over-
lapping stripes. This decreases the buffer size and increases parallelism
at the cost of extra bandwidth and calculations. This is already demon-
strated by the process leading to Figure 3.21(d).
66 Loop transformations in the polyhedral model
tile H
til
e 
V
til
e 
V
tile H
til
e 
V
tile H
til
e 
V
tile H
RC
RC
2LB
LB
RC
2LB
BBLB
BBRC
Figure 3.25: Tree of the IDWT variants listed in Table 3.6.
3.8 Building sequences from subsequences
The locality of the IDWT can be improved in different ways, as shown
by the sequences described in Section 3.7, each dealing with one cause
of the poor locality. Therefore, the combinations of these sequences are
promising when one wants to tackle several causes at the same time.
Although the application of transformation sequences is automated,
still substantial effort of the user was needed to figure out how a pri-
mary sequence could be constructed from elementary transformations,
i.e., those offered by the tool URUK.
The described primary sequences can be combined to form larger
sequences, just by applying the URUK scripts in sequence. Only minor
interventions are needed, e.g., adjusting code labels and other trans-
formation operands (mainly shift offsets), omitting redundant trans-
formation steps (between brackets in Table 3.5), typically taking in the
order of ten minutes all together, much less than for the construction
of the primary sequences. This way a space of variants, e.g., as shown
in Figure 3.25, can be explored in a short time. The abstraction level
is raised from combining elementary transformation steps (Section 3.7)
to combining application specific primary sequences (this section), by
which larger steps in promising directions can be taken (cf. number of
elementary transformation steps in Table 3.6). Below we will discuss
some variants and examine their properties.
In Table 3.6 the temporal locality is indicated by the reuse distances.
Recall that in a cache or local memory with the Least Recently Used
3.8 Building sequences from subsequences 67
Table 3.6: IDWT variants with the order of primary sequences performed.
The number of elementary transformation steps is given for a 3-level IDWT
(k = 3). The outer loop (iterator l) has been unrolled to fit in the polyhedral
model. As a result, many transformation steps occur three times; once for
each level. The number of lines in the generated code is given as generated
by CLooG [26] with (a) and without (b) control optimizations (command line
options “-f 1” and “-f -1”, respectively, cf. example 6). The temporal lo-
cality is indicated by the number of accesses with a reuse distance larger than
a given size (for an image of 72 × 88 pixels). Detailed results are found in
appendix B.
R
C2
LB
Ti
le
H
Ti
le
V
#E
l.
st
e
ps # lines C # Reuse Distances
(a) (b) > 29 > 211 > 213 > 214
RC 0 32 32 10202 8954 4868 489
RC H 1 28 166 55 12731 10981 6035 0
RC HV 1 2 54 456 69 14213 10695 2114 0
RC V 1 19 98 46 11845 9106 2103 0
RC VH 2 1 47 400 57 14267 10801 2120 0
VH LB 3 2 1 62 1887 75 15660 10876 2008 0
LB 1 18 83 56 33646 1881 1119 211
LB H 1 2 40 926 71 14871 13124 6508 0
LB HV 1 2 3 53 2765 73 16617 12456 6748 0
LB V 1 2 27 425 64 34242 4678 144 0
LB VH 1 3 2 52 5639 75 16685 11899 774 0
(LRU) replacement policy, data is retained between reuses if and only
if the corresponding reuse distance is smaller than the cache or buffer
size [42]. Based on this the number of cache misses can be plotted as
a function of the cache size as in Figure 3.26. The detailed measure-
ment results are found in appendix B. They indicate that no variant
outperforms another one for all possible cache sizes.
As already mentioned in Section 3.7, the row-column-based IDWT
(RC) has a bad data locality. An entire frame has to be stored between
the vertical and horizontal filtering. In a line-based variant (LB) this is
reduced to several lines. If the lines are too long to be stored on-chip,
horizontal tiling may help by dividing the lines in smaller parts (LB H).
One can see in Figure 3.26 that the misses of LB are strongly reduced by
68 Loop transformations in the polyhedral model
100
1000
10000
100000
1e+06
0 2 4 6 8 10 12 14 16
N
u
m
be
r
o
fm
is
se
s
log
2
(Cache size)
RC
RC H
RC HV
RC V
RC VH
VH LB
LB
LB H
LB HV
LB V
LB VH
Figure 3.26: Cache misses in function of cache size (assuming LRU) based on
the reuse distance histograms (Appendix B).
increasing the cache size from 210 to 211, i.e. when the cache becomes
large enough to save several lines of an image, while LBH already takes
a step down (be it a smaller one) at a size of 29. However, in most cases
LB H is outperformed by RC H.
If the IDWT is done level by level (RC and LB), each reconstructed
LL-subband has to be stored between the different transformation lev-
els. When the different levels run simultaneously, the buffers between
them can be reduced. This is possible in hardware by implementing
separate designs for each level and let them work in a pipe-line [54], or
in software by placing each level in a separate thread. This is only ben-
eficial if the scan-patterns of the different levels correspond (LB and not
RC). Another way to reduce the data stored between levels (for both LB
and RC) is to interleave the operations of different levels by tiling.
A block-based IDWT [167] can be reached in several ways. First
3.8 Building sequences from subsequences 69
tiling vertically and then horizontally (RC VH, LB VH and VH LB) or
vice versa (RC HV and LB HV). With (BBLB) or without (BBRC) transi-
tion to line-based influences the scanning order within a tile; line-wise
for the former and row-column-wise for the latter. In both cases the
order of scanning the tiles is the same. The order in which the primary
sequences are applied, e.g., LB VH vs. LB HV and VH LB, has an in-
fluence on the resulting variant. This is illustrated by the number of
generated lines of C code, but also reflects in the resulting locality. This
is similar to the influence of the location of the shift in Figure 3.19 and
3.20. In a manual implementation, trying several permutations of the
application of transformations is typically not considered. This demon-
strates the usefulness of the proposed methodology. In a more thor-
ough exploration the tile size should be varied, which is not done here.
The original RC variant performs quite well if little buffer mem-
ory is available. Note that the reuse distance only contains information
about the temporal locality. Other important factors that determine the
performance have to be considered, such as spatial locality (burst mode
usage) and exploitable parallelism. Furthermore, if the IDWT is only
one block in a system it is usually beneficial to adjust the data write
and read pattern of successive modules to each other.
Why horizontal and vertical tiling do not commute
It is easy to understand that tiling before or after going to a line-based
variant leads to different results as mentioned above. However, it is
less obvious where the differences come from by interchanging the or-
der of horizontal and vertical tiling. Therefore, we took a closer look
at the transformation sequences RC HV and RC VH. With some small
changes the sequences can be made equivalent.
The peeling in the horizontal tiling makes that the vertical tiling
works on a different number of loops and statements if it is done before
or after the horizontal tiling. This can be compensated by inserting
three loop fusions (one for each wavelet transform level) at the end
of the RC HV URUK script. Also the sequences LB HV and LB VH
could be transformed into sequences with equivalent results by adding
a series of shifts on two places in the LB HV script.
70 Loop transformations in the polyhedral model
3.9 Comparison with alternative representations
3.9.1 Alternative polyhedral models
As announced in Section 3.4.2, the schedule vector used until now,
defining a multidimensional schedule with dimensions for iterators
and dimensions for the ordering of statements, is only one possible
format of the schedule function.
Early versions of polyhedral models only dealt with perfect loop
nests and had schedule vectors of a dimension equal to the number
of iterators. Applying loop transformations then consisted of applying
linear transformations on the schedule vector (= iteration vector). In
the resulting schedule vector each element is a linear combination of
the iterators (and possibly also parameters).
In the affine 1-dimensional scheduling case [84], the schedule is a
single, i.e. 1-dimensional, linear combination of the iterators and pa-
rameters:
θSx(ISx) = u

 ISxIgp
1

 ,
with u a vector.
In [137] Pouchet et al. suggest that by only considering 1-dimen-
sional affine schedules and adding some bounds on the schedule co-
efficients the search space of legal transformations can be made small
enough for an exhaustive scan. In the examples they give, the search
space consists of in the order of hundreds or thousands of schedules.
For larger programs a heuristic search method is proposed, which first
tries to find the optimal iterator coefficients and then the others. How-
ever, oftenmultidimensional schedules are needed to find legal or more
optimal schedules. The extension of this work to multidimensional
schedules [136, 138] enlarges the search space with many orders of
magnitude. Therefore, a genetic algorithm is used to traverse the opti-
mization spaces, with satisfiable results.
3.9.2 Transformations on ASTs
Loop transformations can also be applied directly on an AST represen-
tation of a program. This avoids the work of extracting iteration do-
mains of statements and simplifies the code generation. A loop strip-
3.9 Comparison with alternative representations 71
mine, e.g., can be done by adding an extra loop node in the tree, a loop
fusion by combining two nodes. The size of the internal representation
is always proportional with the size of the corresponding code. After
a sequence of transformation steps this number can become very large
(cf. Table 3.6). An optimization option like “-f” can not easily be ap-
plied.
3.9.3 Advantages of the polyhedral model
The polyhedral model has several benefits over an AST representation
when performing loop transformations on software as shown in [57, 90,
91].
The application of a long sequence of transformations often leads
to a large increase in code size when using a textual representation
and thus also an AST representation grows rapidly. As can be seen in
Figure 3.19(d), extra code is needed at the borders of the iteration do-
mains. For larger examples or whenmore parameters exist this only be-
comes worse. The polyhedral representation in contrast only increases
slightly. The resulting code sizes in Figure 3.6 illustrate this. This code
increase makes working on ASTs more complex and may inhibit long
transformation sequences.
The polyhedral model allows to, e.g., fuse loops with different
bounds or tile non-perfectly nested loops. Other methods fail to rec-
ognize (by pattern-matching) the opportunity to perform these trans-
formations. For example, in Figure 3.10 a loop interchange of i and j
is not possible because the upper loop bound of j is a function of i. In
the polyhedral model loop bounds are only calculated by the code gen-
eration after the transformations are performed and the interchange is
simply done by switching i and j in the schedule vector.
Often, a long sequence of enabling transformations is needed before
the actual optimizing transformations are possible. The application of
one transformation might produce code that inhibits another transfor-
mation. When using the polyhedral model, code is only generated after
the last transformation. The intermediate representation offers more
flexibility as invocations of statements are not bound to a control struc-
ture.
72 Loop transformations in the polyhedral model
3.10 Conclusions
Loop transformations are indispensable for improving aspects of an al-
gorithm implementation, such as locality and parallelism. A polyhe-
dral representation allows to automate the application of loop transfor-
mations. This makes exploring a part of the loop transformation design
space more feasible. However, to fully automate the choice of transfor-
mations to apply, a lot of research is still needed.
When applying a long sequence of transformations, the order of
transformation steps may have an impact on the final result. We have
demonstrated that by combining subsequences long transformation se-
quences can be constructed more easily. It is an open question if the
adjustments needed to have a correct subsequence combination can
be automated. Some tasks, such as adjusting the vectors over which
statements are shifted, appear likely to be automated. To automate the
adjustment of labels and strip-mine factors probably some deeper un-
derstanding (future research) is needed.
Chapter 4
Bounds on
quasi-polynomials for
static program analysis
To guide loop transformations, some desired program properties, such as lo-
cality or memory usage, have to be evaluated. This can be done by profiling a
program run or by static analysis of the code. In the latter case, the evaluation
problem can often be expressed as counting the number of integer points in a
(parameterized) polyhedron. The result will then be a quasi-polynomial in the
parameters. In some cases, the extremum of such a quasi-polynomial over the
integer points in a polyhedron has to be found. This will be the focus of this
chapter.
Several techniques exist to bound the range of polynomials over continu-
ous domains. The maximal difference between the continuous- and discrete-
domain extrema of polynomials will be studied. This allows to know when the
continuous-domain extrema can be used as a sufficient approximation of the
discrete-domain extrema. Finally, the results will be used to find bounds on
quasi-polynomials, by converting them into (sets of) polynomials.
4.1 An example of static program analysis
As described in Chapter 3, loop transformations can be used to improve
some properties, such as temporal and spatial locality and memory us-
age. However, to guide the transformation process these properties
have to be measured or estimated in some way. This can be done by
74 Bounds on quasi-polynomials for static program analysis
#define S1(i,j) \
a=((j>i) ? A[i][j-1] : 0); \
b=((i>1) ? A[i-1][j] : 0); \
A[i][j]=f(a,b);
for (i=1;i<=N;i++){
for (j=i;j<=N;j++){
S1(i,j);
}
}
i
j
N
N
1
1
(a) (b)
Figure 4.1: (a) Program example. (b) Iteration domain of the example in (a).
The arrows indicate dependences. The dotted line represents a point of exe-
cution of the linear schedule θS1(i, j) = i + 3j. The origins of the arrows cut
by this line correspond to the live set at that point.
profiling or by static analysis. The former gives precise numbers at the
cost of program compilation and running the program for each desired
set of parameter values. The latter allows a more analytic analysis of
the influence of parameters but might not be applicable in all cases.
In many cases the problem of calculating program properties with-
out running the program, but by statically analyzing the code can be
reduced to a polyhedral counting problem (the exact definition follows
in Section 4.2), e.g., counting the number of points in a Z-polytope. In
[40] the calculation of reuse distances is expressed as such a counting
problem. In some cases an extremal value of the solution of a counting
problem is needed, as is demonstrated in the example below.
Consider the program in Figure 4.1(a). The number of array ele-
ments accessed is equal to the number of points in the iteration domain
DS1 = {(i, j) ∈ Z2 | 1 ≤ i ≤ N, i ≤ j ≤ N} , (4.1)
and thus is equal to
|DS1| = N(N + 1)
2
.
However, less memory is needed if an element is only kept in memory
as long as its value is needed by a future statement invocation. The
arrows in Figure 4.1(b) indicate these Read-after-Write dependences. The
4.1 An example of static program analysis 75
set of dependences can be written as
DS1δS1 = {(i1, j1, i2, j2) ∈ D2S1 |
(i2, j2) = (i1, j1) + (0, 1) ∨ (i2, j2) = (i1, j1) + (1, 0)} .
After the execution of S1(1,N), the elements A[1][2] till A[1][N ] are in
the memory. After this iteration, the memory used decreases monoton-
ically. Each time a new element is produced, one or two elements are
consumed for the last time and can be removed. The memory usage is
thus N − 1.
By changing the execution order of the statement invocations the
memory usage can be reduced. We introduce a one-dimensional affine
(or linear) schedule [84], θS1(i, j) = s1i + s2j that defines a new exe-
cution order. In a valid schedule, data should be produced before it is
consumed:
V (s1, s2) : ∀(i1, j1, i2, j2) ∈ DS1δS1 : θS1(i1, j1) < θS1(i2, j2) .
The set of live elements of array A at a certain point t is the set of el-
ements that are written before and read after the point t. Since in this
example every element of array A is only written once and every other
access to that element is a read, this set is equal to the set of elements
that are accessed both before and after the point t, which is expressed
as:
LEA(t) = {(i1, j1) ∈ DS1 | ∃(i2, j2) ∈ DS1 :
(i1, j1, i2, j2) ∈ DS1δS1 ∧ θS1(i1, j1) < t ∧ t ≤ θS1(i2, j2)} .
(4.2)
Here the assumption is made that the output data is immediately re-
moved when ready, and input data is only fetched when it is used for
the first time. In a practical implementation an input and output buffer
will be needed.
The memory size needed for the execution of a valid schedule is the
maximum number of elements in LEA(t) for all values of t, i.e.
max
t
|LEA(t)| . (4.3)
The schedule with the minimal memory requirements corresponds to
min
s1,s2:V (s1,s2)
max
t
|LEA(t)| .
76 Bounds on quasi-polynomials for static program analysis
The focus of this chapter is on finding a way to find the maximum
(or minimum) of the solution of a counting problem (a (piecewise) quasi-
polynomial, see Section 4.2) over a polyhedral domain, or in this exam-
ple (4.3). Note that the value of this extremum is more important than
its location.
4.2 Quasi-polynomials and
polyhedral counting problems
The general form of an Ehrhart quasi-polynomial1 is defined in terms
of periodic numbers.
Definition 12. A rational periodic number u(n) is a function u : Z → Q,
such that u(n) = u(n′) whenever n ≡ n′ (mod d), where d ∈ N is called the
period of u(n).
A periodic number can be represented by an array of rational num-
bers:
u(n) = [u0, u1, . . . , ud−1]n ,
which means
u(n) = un mod d =


u0 when n ≡ 0 (mod d)
u1 when n ≡ 1 (mod d)
...
ud−1 when n ≡ d− 1 (mod d)
.
Alternatively, u(n) can be expressed using fractional parts of linear ex-
pressions. For example,{n
3
}
+
{n
2
}
=
[
0,
5
6
,
2
3
,
1
2
,
1
3
,
7
6
]
n
,
where the notation {.} denotes the fractional part, i.e. {x} = x − ⌊x⌋.
Henceforth, we shall only use the fractional notation. It is straightfor-
ward to convert a fractional notation into an array representation. Since
{
n− 1− i
d
}
−
{
n− i
d
}
=


d−1
d
= 1− 1
d
when (n− i) ∈ dZ ,
i.e. n ≡ i (mod d) ,
−1
d
otherwise ,
(4.4)
1For brevity Ehrhart will be omitted from here on.
4.2 Quasi-polynomials and polyhedral counting problems 77
the opposite can be done, e.g., by using
[u0, u1, . . . , ud−1]n =
d−1∑
i=0
ui
(
1
d
+
{
n− 1− i
d
}
−
{
n− i
d
})
. (4.5)
Definition 13. A quasi-polynomial f(p) of degree g in one variable p is a
polynomial expression in p over the rational periodic numbers, i.e.
f(p) =
g∑
i=0
ui(p)p
i ,
where the ui(p) are periodic numbers.
The definitions of periodic number and quasi-polynomial can be
extended to the multivariate case [162]. A piecewise quasi-polynomial is
a function that is a quasi-polynomial on each element of a partition of
the domain into polyhedral subdomains, called chambers. Our interest
in quasi-polynomials stems from the following theorem.
Theorem 2 (Clauss and Loechner [55]). The number of integer points in a
parameterized polytope Pp of dimension n can be expressed on each chamber
of a partition of the parameter space by a quasi-polynomial of degree n in p.
Here p represents a vector of integer parameters, p ∈ Zm. The parti-
tion into polyhedral chambers originates from the fact that parametric
vertices (Section 3.3.3) may only exist for a subset of the parameter val-
ues. The polyhedral chambers are defined in such a way that in each
chamber a certain set of parametric vertices exist for all parameter val-
ues within that chamber. The quasi-polynomial in that chamber can
then be computed from the corresponding parametric vertices.
More generally, the number of points in the integer projection of a
parameterized Z-polytope, as well as the number of solutions to a so-
called Presburger formula (linear inequalities combined with the logi-
cal operators and quantifiers ∧, ∨, ∀, ∃, ¬) can also be expressed as a
piecewise quasi-polynomial [162, 161].
Several methods can be used to actually compute the number of
integer points in such sets. An overview can be found in [159, 40]. For
the experiments in this chapter we will use the barvinok library [160],
which implements and extends the techniques proposed by Barvi-
nok [25]. This library can generate the solution of a polyhedral counting
problem as a piecewise quasi-polynomial, using either the fractional or
78 Bounds on quasi-polynomials for static program analysis
the array representation of periodic numbers. In our experiments, we
will use the fractional representation, as it is polynomially-sized (for a
fixed number of dimensions n), while the array representation can be
exponentially large (even for fixed dimensions).
For integer polytopes, i.e. polytopeswith integer vertices, the quasi-
polynomial that gives the number of integer points in the polytope as
a function of the (integer) parameters is actually a polynomial [20, 127].
Let us go back to the example in Section 4.1. We consider s1 and
s2 as constants so that LEA(t) is defined by linear constraints. In this
case the number of elements in this set can be described by a piece-
wise quasi-polynomial in the parameters N and t. For example, for
θS1(i, j) = i+ 3j and N = 100 we get
|LEA(t)| =


1 if 5 ≤ t ≤ 6
1
4t+ [−1,−14 ,−12 ,−34 ]t if 7 ≤ t ≤ 301
75 if t = 302
−34t+ [301, 12074 , 6032 , 12054 ]t if 303 ≤ t ≤ 398
−t+ 401 if 399 ≤ t ≤ 400
0 otherwise
. (4.6)
If N is not given a value but is kept as a parameter the expression be-
comes much larger and is therefore not shown here.
4.3 Overview and methodology
An exact way to find the extrema of a (quasi-)polynomial over a dis-
crete domain is evaluating it in every point of the domain. For large
domains this is clearly infeasible. De Loera et al. [62] have shown that
the problem of finding the extrema of arbitrary polynomials over the
integer points in (non-parameterized) polytopes is NP-hard (even for
fixed dimensions).
Several techniques exist that yield bounds on polynomials over
continuous domains (Section 4.4). Continuous-domain extrema can be
used as approximations of the discrete subdomain extrema, provided
that the difference between the two satisfies the accuracy requirements.
In Section 4.5 we study the maximal approximation error. If this error
is too large it can be reduced by partial evaluation of the polynomial for
4.4 Bounds on polynomials in continuous domains 79
the values of a selection of the variables. For example, the polynomial
f(x, y) = −2x2y2 + 6x2y − 2x2 + 6xy2 − 18xy + 6x− 2y2 + 6y (4.7)
over the domain [0, 9] × [0, 2] has (continuous-domain) extrema 139.5
and−108, while the extrema in the discrete subdomain Z2∩ [0, 9]× [0, 2]
are 112 and −108. Partially evaluating f(x, y) for the possible values of
y leads to 3 polynomials in one variable
f0(x) = f(x, 0) = −2x2 + 6x
f1(x) = f(x, 1) = 2x
2 − 6x+ 4
f2(x) = f(x, 2) = 2x
2 − 6x+ 4 ,
with continuous-domain extrema 9/2, −108, and twice 112 and −1/2,
respectively. The continuous-domain extrema now coincide with the
discrete-domain extrema.
Section 4.6 proposes three methods to convert quasi-polynomials
into (sets of) polynomials so that bounds on quasi-polynomials can be
derived. Experiments (Section 4.7) investigate the accuracy and com-
putation cost of the different methods. By combining different meth-
ods hybrid methods can be constructed. This will be discussed in Sec-
tion 4.8.
4.4 Related work: bounding the range of
polynomials in continuous domains
An overview and comparison of methods to find bounds on polynomi-
als over continuous domains can be found in [139, 122, 144].
Interval Arithmetic and its variants study the range of an expression
as a function of the intervals in which the variables of the expression lie.
The intervals are added, subtracted, multiplied, . . . in such a way that
the computed interval is guaranteed to contain the value of the expres-
sion for all values of the variables within the intervals. A problem is
that the correlation between different terms is neglected. For example,
x ∈ [−1, 1] ⇒ (1− x) ∈ [0, 2]
⇒ x+ (1− x) ∈ [−1, 1] + [0, 2] = [−1, 3] .
However,
x+ (1− x) = 1 ,
80 Bounds on quasi-polynomials for static program analysis
for any x. Similar problems occur in the evaluation of polynomials and
can be addressed by not writing a polynomial using the power basis
but using a Bernstein basis, Horner form or centered form [139].
Rivlin’s method [142], extended to the multivariate case by Garloff
[89], samples the polynomial in equidistant points and, based on the
mean value theorem, derives a bound on the difference between the
extrema of the samples and the extrema of the polynomial itself. For a
given polynomial f(x) =
∑g
i=0 aix
i of degree g the following holds on
x ∈ [0, 1]:
min
0≤i≤k
f
(
i
k
)
− αk ≤ f(x) ≤ max
0≤i≤k
f
(
i
k
)
+ αk , (4.8)
where
αk =
1
8k2
g∑
i=0
i(i− 1)|ai| . (4.9)
Note that (4.9) gives a measure for the maximal absolute error for a
given polynomial.
Gopalsamy [92] uses a similar technique. He does not sample the
polynomial in equidistant points but in well chosen points, only de-
pending on the degree of the polynomial.
The Bernstein expansion [144, 89, 56] writes a polynomial as a linear
combination of Bernstein basis polynomials. Themaximal andminimal
Bernstein coefficients provide bounds on the polynomial. This method,
summarized in Appendix C, has the nice property that if the minimum
or maximum is reached at a vertex of the domain, then that bound will
be exact. For example, in Figure 4.2 the lower boundF b on f derived by
Bernstein expansion is exact since the minimum of f over the interval
[0, 2] is reached for x = 0. Bernstein expansion can be used over a
polytope domain [56, 107] and also over parameterized domains [56],
in contrast with most other methods that need a box-shaped domain
with unparameterized borders.
For most techniques (Rivlin, Bernstein) the approximation error can
be reduced by dividing the domain into smaller parts and applying the
technique on each subdomain. For example, in Figure 4.2, the upper
bound F b on the polynomial function f found by Bernstein expansion
over the interval [0, 2] is
F b[0,2] = 5 ,
while subdivision in two parts leads to
max(F b[0,1], F b[1,2]) = max(3, 4) = 4 .
4.5 Cont.- vs. discr.-domain extrema of polynomials 81
F c
Fb[0,1] = Fd
Fb[0,2]
Fb[1,2]
F c = Fd = Fb
0 1 2
x−1
0
1
2
3
4
5
f
(x
)
=
−2
x
2
+
6x
−
1
Figure 4.2: Example polynomial functionwith extrema in the discrete ({0,1,2})
and continuous ([0,2]) domain. The Bernstein expansion gives an upper
bound on the maxima, which becomes more accurate when the interval is
split into smaller parts.
Note that the upper bound in the interval [0, 1] is exact just like the
lower bound mentioned above. When computing the Bernstein coef-
ficients in one subregion, intermediate values for the computation of
the coefficients in neighboring subregions are produced. This allows to
compute the Bernstein coefficients of a subdivision more economically,
as demonstrated for a triangle in [107].
4.5 Continuous- versus discrete-domain extrema
of polynomials
4.5.1 Univariate case
Consider an arbitrary polynomial f(x) =
∑g
i=0 aix
i of degree g. We
will use the notation F d and F d for the maximum and minimum of this
polynomial in the discrete domain Z ∩ [Nmin,Nmax], Nmin,Nmax ∈ Z,
and F c and F c for the extrema in the enclosing continuous interval
[Nmin, Nmax].
2 For Nmin = Nmax these intervals obviously coincide.
Therefore, we restrict ourselves to the caseN = Nmax−Nmin > 0. With-
out loss of generality we will further assume Nmin = 0 and Nmax = N .
2Here F refers to the polynomial f . When another name is used for the polynomial,
e.g., g or h this becomes Gd, Gd, Gc and Gc, or Hd, Hd, Hc and Hc, respectively.
82 Bounds on quasi-polynomials for static program analysis
An example is shown in Figure 4.2.
Definition 14. If one uses the continuous-domain extrema of a polynomial f
as an approximation of the discrete-domain extrema, the relative approxima-
tion error on the range, RE (Relative Error), is defined as
RE =


(F c − F c)− (F d − F d)
F d − F d
when F d 6= F d ;
0 when F d = F d ∧ F c = F c ;
∞ when F d = F d ∧ F c 6= F c .
(4.10)
For the example in Figure 4.2 we get RE = 0.54 = 0.125. Note that,
by definition, F c ≥ F d and F c ≤ F d. As a result, RE is always non-
negative. Note also that it is scale and position invariant, i.e. the poly-
nomial af(x) + bwith a, b ∈ R will yield the same value for RE as f(x)
does. In the cases g = 0 or g = 1, the polynomial is a linear function,
with extrema in 0 and N , which results in RE = 0. For higher degrees
we present the following theorems.
Theorem 3. If g ≥ 2 and N ≥ g then there exists an upper bound on the rel-
ative approximation error RE , caused by using the continuous-domain range
as an approximation of the discrete-domain range, that can be reached for an
arbitrary polynomial of degree g over the interval [0,N ]. Otherwise, no such
bound exists.
Proof. If N < g then a polynomial can be constructed with RE = ∞.
For example, the polynomial h(x) =
∏g−1
i=0 (x− i) has degree g, and for
n ∈ {0, 1, . . . , N} it follows that h(n) = 0 = Hd = Hd, while Hc 6= Hc
and thus RE =∞.
If N ≥ g > 0 then F d 6= F d, for an arbitrary f(x). Indeed, F d =
F d = f(n) = c, for all n ∈ {0, 1, . . . ,N} would imply that f(x)− c is a
polynomial of degree g with N + 1 > g zeroes, which is impossible if
g > 0.
In view of the invariance of RE , we can assume F d = −1 and F d =
1. The polynomial can be rewritten using Lagrange interpolants [61,
143] as
f(x) =
g∑
i=0
f(xi)LS,i(x) ,
4.5 Cont.- vs. discr.-domain extrema of polynomials 83
where theLS,i(x) are degree-g Lagrange interpolants or basis functions,
given by
LS,i(x) =
∏
0≤j≤g,j 6=i
x− xj
xi − xj (4.11)
=
(x− x0)(x− x1) · · · (x− xi−1)(x− xi+1) · · · (x− xg)
(xi − x0)(xi − x1) · · · (xi − xi−1)(xi − xi+1) · · · (xi − xg)
for i ∈ {0, 1, . . . , g}, with S = {x0, . . . , xg} an arbitrary subset of
{0, 1, . . . , N} with g + 1 elements, denoted S ∈ SN , with
SN = {S ⊂ {0, 1, . . . ,N}, |S| = g + 1} . (4.12)
By construction,
LS,i(xj) = δij =
{
1 if i = j ,
0 if i 6= j . (4.13)
By setting F d = −1 and F d = 1we know that
|f(n)| ≤ 1,∀n ∈ S ⊂ {0, 1, . . . ,N}
and that for all x ∈ [0, N ] and for all (N+1
g+1
)
possible choices of S ∈ SN ,
|f(x)| =
∣∣∣∣∣
g∑
i=0
f(xi)LS,i(x)
∣∣∣∣∣
≤
g∑
i=0
|f(xi)||LS,i(x)|
≤
g∑
i=0
|LS,i(x)| .
The sum of the absolute value of the Lagrange basis functions is termed
Lebesgue function [150, 120, 143]. Since S is an arbitrary element of SN ,
|f(x)| ≤ min
S∈SN
g∑
i=0
|LS,i(x)| , MN (x)
≤ max
x∈[0,N ]
(
min
S∈SN
g∑
i=0
|LS,i(x)|
)
, K , (4.14)
84 Bounds on quasi-polynomials for static program analysis
whereK is a function ofN and g only. This results in
RE =
(F c − F c)− 2
2
≤ 2K − 2
2
= K − 1 (4.15)
and thus an upper bound on theRE that can be reached for an arbitrary
polynomial of the given degree g exists.
If g = 2, the polynomial function f can only have one local ex-
tremum and therefore, either F c = F d = 1 or F c = F d = −1. As a
result, the bound on RE in (4.15) can be refined to
RE ≤ (K + 1)− 2
2
=
K − 1
2
. (4.16)
Definition 15. We introduce MRE (g,N) to denote the supremum of RE
over all polynomials of degree g on an interval with size N . MRE(g,N) can
be a real number or infinity.
IfK is written as a function of g and N it follows from (4.15) that
MRE(g,N) ≤ K(g,N) − 1 . (4.17)
One can prove that K(g,N) ≥ K(g,N + 1) and thus K(g,N) is mono-
tonically nonincreasing with N , for a given g. We will prove that
K(g,N) converges to 1 (and thus MRE converges to 0), as N → ∞
(Theorem 4), but first present a lemma needed to prove the theorem.
Lemma 1. The function MN (x) = minS∈SN
∑
i |LS,i(x)| is symmetric
about the axis x = N/2.
Proof. We define linear transformations on sets as
kS + l = {ku+ l | u ∈ S}, for k, l ∈ R ,
with the following property for the Lagrange interpolants:
LkS+l,i(kx+ l) = LS,i(x) . (4.18)
One can easily prove that
S ∈ SN ⇔ S′ = ((−1)S +N) ∈ SN .
From (4.18) it follows that LS′,i(x), with S
′ = (−1)S + N , is the mir-
ror image of LS,i(x) about the axis x = N/2, and thus
∑
i |LS′,i(x)|
is the mirror image of
∑
i |LS,i(x)|. This means that for each function∑
i |LS,i(x)| that is included in the minimization over SN , also the mir-
ror image about x = N/2 is included. As a result, the obtained function,
MN (x) is symmetric about x = N/2.
4.5 Cont.- vs. discr.-domain extrema of polynomials 85
Theorem 4. For a fixed degree g ≥ 2,MRE (g,N), and thus the maximal RE
that can be reached by an arbitrary polynomial, converges to 0 as N increases
indefinitely.
Proof. A similar notation is used as in the previous theorem. An index
N is used to indicate the dependence on N .
KN = max
x∈[0,N ]
MN (x)
It suffices to show that KN converges to 1. This follows from (4.17).
Convergence of KN requires that for an arbitrary ǫ > 0 there exists an
Nm for which 1 − ǫ < KN ′ < 1 + ǫ, for all N ′ ≥ Nm. Note that by
construction 1 ≤ KN ′ and thus 1− ǫ < KN ′ for any ǫ > 0.
If xj ∈ S then ∑
i
|LS,i(xj)| =
∑
i
δij = 1 ,
using (4.13) and thus
MN (n) = 1,∀n ∈ {0, 1, . . . ,N},∀N ≥ g .
Choose an arbitrary value of N , e.g., Ng = g. SinceMNg (x) is a contin-
uous function and sinceMNg (0) = 1,
∃ b ∈ (0, 1] such thatMNg(x) < 1 + ǫ, for x ∈ [0, b] .
SinceMNg(x) is symmetric about x =
Ng
2 , it follows thatMNg (x) < 1+ǫ
for x ∈ [Ng − b,Ng]. Let
a =
⌈
1
b
⌉
, i.e. a ∈ N and ab ≥ 1 ,
then
MNg
(x
a
)
< 1 + ǫ , ∀x ∈ [0, ab] ∪ [aNg − ab, aNg] ,
hence also, since ab ≥ 1,
MNg
(x
a
)
< 1 + ǫ , ∀x ∈ [0, 1] ∪ [aNg − 1, aNg] . (4.19)
Choose
N ′ ≥ 2aNg − 2 = Nm, N ′ ∈ N , (4.20)
86 Bounds on quasi-polynomials for static program analysis
and create SN ′,j ⊂ SN ′ from SNg by transforming the elements in SNg
as follows
SN ′,j = {aS + j | S ∈ SNg}, j = 0, 1, . . . ,N ′ − aNg .
Using (4.18) and (4.19) this results in
min
S∈SN′,j
∑
i
|LS,i(x)| = min
S∈SNg
∑
i
|LaS+j,i(x)|
= min
S∈SNg
∑
i
∣∣∣∣LS,i
(
x− j
a
)∣∣∣∣
= MNg
(
x− j
a
)
< 1 + ǫ, when x ∈ [j, j + 1] or
x ∈ [aNg − 1 + j, aNg + j] .
Knowing that ∪jSN ′,j ⊂ SN ′ , and using (4.20)
MN ′(x) = min
S∈SN′
∑
i
|LS,i(x)|
≤ min
S∈∪jSN′,j
∑
i
|LS,i(x)|
< 1 + ǫ ,
when
x ∈
N ′−aNg⋃
j=0
[j, j + 1] ∪ [aNg − 1 + j, aNg + j] = [0,N ′] .
Hence,KN ′ < 1 + ǫ, and the theorem follows.
The bound (4.14) obtained in the proof of Theorem 3, may be an
overestimate of the real bound. For g = 2 and g = 3 an exact bound can
be obtained. With the above notation we have the following theorem:
Theorem 5. For any polynomial of degree 2, the relative error (RE ) in the
interval [0, N ], N ≥ 2, satisfies
RE ≤
{
1
(N+1)2−1
if N is even
1
N2−1
if N is odd .
4.5 Cont.- vs. discr.-domain extrema of polynomials 87
Proof. Let f(x) = ax2+bx+c. Without loss of generality we can assume
c = 0 (vertical shift does not alter F − F ) and a < 0.
If the maximum of f over R (xt =
−b
2a , f(xt) =
−b2
4a ) lies outside
[0, N ] then RE = 0. Therefore, we assume
0 ≤ −b
2a
= xt ≤ N
and even
N
2
≤ −b
2a
= xt ≤ N ,
as mirroring horizontally about x = N2 and shifting vertically does not
alter RE . This implies
F c = F d = 0, F c =
−b2
4a
RE =
F c − F d
F d
.
Now, RE is maximal for a given F c if F d is minimal. Varying F c by
vertically scaling f(x) does not alter RE . Let −b2a = xt vary between
N
2
and N while keeping F c =
−b2
4a constant and f(0) = 0. This is done by
horizontal scaling. Then
F d = max(f(⌊xt⌋), f(⌈xt⌉))
reaches a local minimum when
f(⌊xt⌋) = f(⌈xt⌉)
or, thanks to the symmetry of f(x) about the axis x = xt,
{xt} = 1
2
.
As f(x) is more flat for lower values of |a| or higher values of xt the
globally minimal F d and maximal RE are found for the smallest xt
that satisfies:
{xt} = 1
2
and
N
2
≤ xt ≤ N .
Thus
xt =
N + 1
2
, F d = f
(
N + 2
2
)
, RE =
1
(N + 1)2 − 1 ,
88 Bounds on quasi-polynomials for static program analysis
Table 4.1: Overview of MRE andK − 1 for several values ofN and g. Figures
rounded to 5 decimals. The values ofK − 1 are plotted in Figure 4.3.
(a) MRE (g,N) (b) K(g,N)− 1
N g = 2 g = 3
2 0.12500 ∞
3 0.12500 0.50343
4 0.04167 0.18808
5 0.04167 0.18808
6 0.02083 0.18808
7 0.02083 0.10017
8 0.01250 0.06257
9 0.01250 0.06257
10 0.00833 0.06257
11 0.00833 0.04289
12 0.00595 0.03127
13 0.00595 0.03127
14 0.00446 0.03127
15 0.00446 0.02382
16 0.00347 0.01876
17 0.00347 0.01876
18 0.00278 0.01876
19 0.00278 0.01516
N g = 3 g = 4 g = 5 g = 6
2 - - - -
3 0.63113 - - -
4 0.29314 1.20782 - -
5 0.18808 0.71234 2.10630 -
6 0.18808 0.41106 1.18573 3.54934
7 0.11482 0.29165 0.87606 2.10933
8 0.08137 0.22147 0.59396 1.33818
9 0.06257 0.17014 0.43558 1.05870
10 0.06257 0.17014 0.34837 0.85179
11 0.04726 0.13640 0.28921 0.65248
12 0.03772 0.10781 0.23577 0.50811
13 0.03127 0.08985 0.19622 0.42642
14 0.03127 0.07590 0.16961 0.36470
15 0.02568 0.06490 0.16358 0.31936
16 0.02171 0.05714 0.14893 0.27609
17 0.01876 0.05714 0.12627 0.23998
18 0.01876 0.04906 0.11088 0.21203
19 0.01612 0.04332 0.09901 0.18901
when N is even and
xt =
N
2
, F d = f
(
N + 1
2
)
, RE =
1
N2 − 1 ,
otherwise.
One can prove that this bound equals K−12 . Since the proof of Theo-
rem 5 provides values of xt for which this bound is reached, the bound
is an exact bound.
For g = 3 the maximum REs can be expressed in a more complex
way. Guidelines for the proof are found in Appendix D. Here only
the maximal RE will be given together with polynomials of the form
f(x) = x3 + bx2 + cwith corresponding values ofN where the MRE is
reached.
We have the following cases:
4.5 Cont.- vs. discr.-domain extrema of polynomials 89
0.001
0.01
0.1
1
10
1 10
K
−
1
N
g = 2
g = 3
g = 4
g = 5
g = 6
Figure 4.3: K − 1 as a function of N for several degrees g (Table 4.1(b)).
• Case N = 4k − 1, k ∈ N:
MRE =
1
18
√
36 k2 + 3
(
12 k2 + 1
)
4 k3 − k − 1
≈ 3
8k2
, for large k
reached by polynomials that satisfy
b = −3
2
N
c =
9
16
N2 − 3
8
N − 7
16
.
90 Bounds on quasi-polynomials for static program analysis
• Case 4k ≤ N ≤ 4k + 2, k ∈ N:
MRE =
2
9
√
9 k2 + 9 k + 3
(
3 k2 + 3 k + 1
)
k (2 k2 + 3 k + 1)
− 1
≈ 1
3k2
, for large k
reached by polynomials that satisfy
b = −6l − 3− 3i
c = 9l2 + (12i + 3)l + 3i2 + 6i− 4 ,
with l ∈ N0, i ∈ {0, 1, 2}, 4l + 4 ≤ N ≤ 4l + 4 + i
or l = 0, i ∈ {1, 2}, 3 + i ≤ N ≤ 4 + i .
Apparently, for N ≤ 19, K(3,N)− 1 = MRE(3,N) if N = 4k +1 or
N = 4k + 2 (Table 4.1 (a) and (b)). This means that for these values of
N the bound on RE defined in (4.15) is an exact bound. We did already
mention that for g = 2 the bound given in (4.16) is exact for all N . The
values ofK − 1 listed in Table 4.1(b) are plotted in Figure 4.3.
We conclude that for large intervals (large N ) the continuous ex-
trema are a good approximation of the discrete extrema and the meth-
ods described in Section 4.4 can be used. For small intervals it is better
to evaluate the polynomial in discrete points, but then the evaluation
cost is small. What is large or small obviously depends on the degree of
the polynomial.
4.5.2 Multivariate case
Theorem 3 can be extended to the multivariate case over box-shaped
domains. Here only the case of two variables is considered, but it can
easily be generalized to more variables.
Theorem 6. If Nx < gx (gx ≥ 2) or Ny < gy (gy ≥ 2) then the RE of an
arbitrary polynomial f(x, y) =
∑gx
i=0
∑gy
j=0 aijx
iyj over [0,Nx] × [0,Ny ] is
not bounded. If Nx ≥ gx and Ny ≥ gy then RE is bounded.
Proof. The scheme of the proof is analogous to the proof of Theorem 3.
A polynomial f(x, y) can be written as a linear combination of La-
grange interpolants:
f(x, y) =
gx∑
i=0
gy∑
j=0
f(xi, yi)LSx,i(x)LSy ,j(y) ,
4.5 Cont.- vs. discr.-domain extrema of polynomials 91
with Sx ∈ Sgx,Nx and Sy ∈ Sgy,Ny , sets of cardinality gx + 1 and gy + 1,
respectively. Without loss of generality we can assume F d = −1 and
F d = 1 and thus
|f(nx, ny)| ≤ 1,∀(nx, ny) ∈ Sx × Sy ⊂ {0, 1, . . . ,Nx} × {0, 1, . . . ,Ny} .
For all (x, y) ∈ [0, Nx]× [0, Ny ] and for all possible Sx and Sy we have
|f(x, y)| ≤
gx∑
i=0
gy∑
j=0
|f(xi, yi)||LSx,i(x)||LSy ,j(y)|
≤ max
x∈[0,Nx]
y∈[0,Ny]

 min
Sgx,Nx
min
Sgy,Ny
gx∑
i=0
gy∑
j=0
|LSx,i(x)||LSy ,j(y)|


≤ max
x∈[0,Nx]
y∈[0,Ny]
(
min
Sgx,Nx
gx∑
i=0
|LSx,i(x)|
) min
Sgy,Ny
gy∑
j=0
|LSy,j(y)|


≤ K(gx, Nx)K(gy,Ny) , Kxy(gx,Nx, gy,Ny) .
The bound on RE now becomes
RE ≤ Kxy(gx, Nx, gy, Ny)− 1 = K(gx,Nx)K(gy,Ny)− 1 . (4.21)
The convergence ofKxy follows from the convergence ofK(gx,Nx)
and K(gy, Ny). Note that MRE(gx,Nx, gy ,Ny) + 1 may differ from
(MRE (gx, Nx) + 1)(MRE (gy,Ny) + 1), e.g., for gx = gy = 2 and Nx =
Ny = 2 we have (MRE(gx,Nx) + 1)(MRE (gy,Ny) + 1) = (9/8)
2 =
81/64 = 1.265625 (Theorem 5), while on this domain the polynomial
f(x, y) in (4.7) has RE = 13/32 = 0.40625 and thus RE + 1 = 45/32 =
90/64 > 81/64.
4.5.3 Optimization strategy
The bounds on RE lead to the following approximation heuristic:
Examine the degrees gi and ranges Ni of the variables of the given
polynomial. Calculate the upper bound on the RE , i.e.∏
i
K(gi,Ni)− 1 .
92 Bounds on quasi-polynomials for static program analysis
Table 4.2: REs of some 2-dimensional examples.
(a) Nx = 2, Ny = 2
RExy MRE(gx,Nx) REx MRE(gy,Ny) REy
f(x, y) 13/32=0.40625 1/8 1/4 1/8 1/4
g(x, y) 17/64=0.265625 1/8 1/8 1/8 1/8
h(x, y) 1/8=0.125 1/8 1/8 0 0
(b) Nx = 9, Ny = 2
RExy MRE(gx,Nx) REx MRE(gy,Ny) REy
f(x, y) 0.125 1/80=0.0125 0 1/8 1/8
g(x, y) 0.130 1/80=0.0125 0.00446 1/8 1/8
h(x, y) 0.00446 1/80=0.0125 0.00446 0 0
If this number does not satisfy the accuracy requirements select the
variable(s) with the largestK and partially evaluate the polynomial for
the discrete values of this(these) variable(s), resulting in a set of poly-
nomials and corresponding domains. The variables with the highest
impact on the RE will typically have a relatively low value of Ni and
thus a low cost of partial evaluation.
Example 7. Consider the polynomials
f(x, y) = −2x2y2 + 6x2y − 2x2 + 6xy2 − 18xy + 6x− 2y2 + 6y
g(x, y) = 4x2y2 − 12x2y − 12xy2 + 36xy
h(x, y) = −2x2y + 6xy
over the domain [0, Nx] × [0,Ny ]. The RE s (RExy) over this domain for
Ny = 2 and two values of Nx are listed in Table 4.2. The polynomials can be
partially evaluated for the discrete values of one variable, as demonstrated on
f in Section 4.3.
Filling in the values of y leads to functions in x, which result in the listed
RExs, and filling in the values of x leads to the RE ys. The indices of RE
indicate the variables that were kept continuous when searching the extrema.
Note that for f and Nx = 2 we find
REx = RE y =
1
4
> MRE(gx,Nx) = MRE(gy,Ny) =
1
8
.
This is possible because F c and F c are found for a different value of y (REx)
or x (RE y).
4.5 Cont.- vs. discr.-domain extrema of polynomials 93
We see that it is more beneficial to evaluate variables with a small range
and/or high degree than other variables. The values of the several MRE (gi,Ni)
give an indication on this.
4.5.4 Comparison with Rivlin’s method
Equations (4.8) and (4.9) also give a bound on the difference between
discrete- and continuous-domain extrema of polynomials. After hor-
izontal scaling a bound on the RE of a polynomial over the domain
[0, N ] can be derived. Finding the discrete- and continuous-domain ex-
trema of f(x) =
∑g
i=0 aix
i for x ∈ [0,N ] is equivalent to finding the
extrema of g(x) = f(Nx) for x ∈ [0, 1], with sampling points xj = jN ,
j ∈ {0, . . . , N}.
g(x) = f(Nx) =
g∑
i=0
aiN
ixi
αN =
1
8N2
g∑
i=1
i(i− 1) |ai|N i
RE rivl =
2αN
F d − F d
≤ 2αN
F c − F c − 2αN
.
For f(x) = x3 + bx2 + cx this becomes
αN =
1
8N2
(2 |b|N2 + 6N3)
=
1
4
(|b|+ 3N) .
For the polynomials with maximal RE listed in Section 4.5.1 this
results in a large overestimate of the RE as shown in Table 4.3. How-
ever, for many polynomials with RE < MRE , RE rivl will be closer to
the real RE than MRE as it uses polynomial specific information. In
general one can use:
RE ≤ min(RE rivl,MRE ) .
with MRE depending on g and N and RE rivl depending on N and the
polynomial coefficients (but the last two).
The method of Rivlin provides a bound on the absolute difference
between continuous- and discrete-domain extrema for a given poly-
nomial, which might be a large overestimate as indicated above. The
94 Bounds on quasi-polynomials for static program analysis
Table 4.3: Comparison of the exact RE (rounded to 4 decimals) with the ap-
proximation RE rivl obtained by using Rivlin’s method for some polynomials
of the form x3 + bx2 + cxwith maximal RE .
N b c RErivl RE RErivlRE
3 -4.5 3.5 2.25 0.5034 4.4694
4 -6 5 0.75 0.1881 3.9878
5 -6 5 0.875 0.1881 4.6524
5 -9 20 1 0.1881 5.3170
6 -9 20 1.125 0.1881 5.9817
7 -10.5 24.5 0.525 0.1002 5.2409
relative error can only be computed after estimation of the extrema. On
the other hand, we provide a bound on the relative difference for an
arbitrary (worst-case) polynomial, which is a large overestimate of the
RE when RE ≪ MRE (best-case polynomial), but can be used without
knowledge of the polynomial.
In practical cases a larger absolute error can be tolerated on large
numbers than on small numbers. This indicates that a bound on the
relative error is a better measure than a bound on the absolute error.
4.6 Converting quasi-polynomials
into polynomials
4.6.1 Concept
The previous section describes the case in which the continuous-
domain extrema can be used as an approximation of the discrete-
domain extrema of a polynomial. By converting quasi-polynomials
into polynomials these results can be used to find bounds on quasi-
polynomials.
For simplicity, we will only present the techniques for quasi-poly-
nomials in one variable in a single domain, i.e. an interval. Our tech-
niques aim at transforming a quasi-polynomial into one or more poly-
nomials, possibly with additional variables. These arise from the elim-
4.6 Converting quasi-polynomials into polynomials 95
Table 4.4: General overview of methods to eliminate fractional expressions.
Method New Substitute Constraints Subdomains
Var {an−r
d
} with with constant
(a) Add Var q q
d
0 ≤ q ≤ d− 1
(b) Mod
Classes
q
q
d
0 ≤ q ≤ d− 1, q ∈ Z,
adjust range n q
(c) Add Var k an− r − kd
d
0 ≤ an−r−kd ≤ d−1
(d) Split
Periods
k
an− r − kd
d
0 ≤ an−r−kd ≤ d−1,
k ∈ Z k
(e) Exact value of {an−r
d
} n
ination of the fractional expressions of the form3 {an−r
d
} in the quasi-
polynomial, where n is the free integer variable and a, d and r are inte-
ger constants (d > 0). In the discrete points of the domain the obtained
polynomials will have the same value as the original quasi-polynomial
in the corresponding points of its domain.
To eliminate the fractional expression {an−r
d
}, consider the follow-
ing. For given d, an integer number an − r can be written in a unique
way as
an− r = kd+ q , (4.22)
with k ∈ Z, q ∈ Z ∩ [0, d− 1] . (4.23)
This corresponds to the definition of modulo reduction and integer di-
vision:
q = (an− r) mod d (4.24)
k =
⌊
an− r
d
⌋
. (4.25)
3This form is used in (4.5) (with a = 1) and thus does not result in a loss of general-
ity.
96 Bounds on quasi-polynomials for static program analysis
Hence we can write{
an− r
d
}
=
q
d
(4.26)
=
an− r
d
−
⌊
an− r
d
⌋
=
an− r
d
− k . (4.27)
We eliminate the fractional expression {an−r
d
} by replacing it by simple
arithmetic expressions containing an additional variable q or k. Several
possibilities how to do this are shown in Table 4.4. In the polynomials
so obtained, the variables (n, q) or (n, k) are constrained by the inequal-
ities
0 ≤ q ≤ d− 1 (4.28)
or
0 ≤ an− r − kd ≤ d− 1 . (4.29)
We are looking for the extremal values of this polynomial for all discrete
values in the domain defined by the range of n and (4.28) or (4.29), re-
spectively. These extremal values can be approximated by the extremal
values over the continuous domain defined by the range of n and (4.28)
or (4.29) in which the variables (n, q) or (n, k) are real-valued. As shown
in Section 4.5, more accurate results can be obtained by restricting one
of the variables (e.g., q or k) to integer values, leading to different poly-
nomials on different (real) subdomains.
4.6.2 Overview of methods with examples
The simplest method to eliminate a fractional expression is to replace it
with q/d (inspired by (4.26) ), where q is a free variable within the (real)
range [0, d−1] (Add Var, Table 4.4(a)). The correlation between n and the
fractional is then ignored (q is not bound to n by an equation like (4.24)
and can even have non-integer values). More accuracy is obtained by
splitting the domain into d subdomains with a constant integer value
of q (0, 1, . . . , d − 1) (Mod Classes, Table 4.4(b)). In each subdomain the
(continuous) range of n is adjusted to the value of q (dashed lines in
Figure 4.4(a)) :
n ∈ [minDq,maxDq] ,
with,
Dq = {m ∈ [0, N ] : (am− r) ≡ q (mod d)} .
Dq is the intersection of the domain of n with the residue class corre-
sponding to the value of q.
4.6 Converting quasi-polynomials into polynomials 97
0
1
2
q
0 1 2 3 97 98 99 100
n
p0
p1
p2
≈
≈
≈
≈
0
1
2
k
0 1 2 3 4 5 6 7 8 9
n
l0
l1
l2
(a) n ∈ [0, 100], d = 3, (b) n ∈ [0, 9], d = 4,
a = 2, r = 1 a = 1, r = 0
Figure 4.4: Domains represented in the (n, q) and (n, k) space. The dots cor-
respond with the (exact) discrete domain. The shaded areas represent the
continuous domains of Add Var. The dashed lines indicate the subdomains
obtained byMod Classes (a) and Split Periods (b), respectively.
Example 8. (small period, large domain)
f(n) = n+ [4, 2, 0]n = n+ 6
{
2n − 1
3
}
0 ≤ n ≤ 100
By replacing
{
2n−1
3
}
with q3 this quasi-polynomial can be converted into a
polynomial:
g(n, q) = n+ 2q 0 ≤ n ≤ 100, 0 ≤ q ≤ 2 .
Note that
f(n) = g(n, (2n − 1) mod 3) .
The dots in Figure 4.4(a) have coordinates that satisfy (n, q) = (n, (2n −
1) mod 3) and thus correspond to points of the quasi-polynomial f . As a
result, F d ≤ Gd ≤ Gc. The continuous-domain maximum of g, Gc = 104, is
thus an upper bound on F d.
The polynomial g can be split into 3 polynomials, with corresponding do-
mains (using modulo classes) (Figure 4.4(a)):
p0(n) = n 2 ≤ n ≤ 98
p1(n) = n+ 2 1 ≤ n ≤ 100
p2(n) = n+ 4 0 ≤ n ≤ 99 .
The continuous-domain maximum now coincides with the discrete-domain
maximum with a value of 103. This is also the maximum of the original
quasi-polynomial f . Note that not all discrete points in the domains of the
pi correspond with a discrete point in the domain of f . This can be solved by
horizontally scaling the domain with a factor 1/3 (possibly after a translation),
but does not matter if the continuous-domain extrema are used.
98 Bounds on quasi-polynomials for static program analysis
0
1
2
98
99
100
q
0 1 2 98 99 100
n
p0
≈
p1
p2
p98
p99
≈
≈
. .
.
0
1
2
k
0 1 2 98 99 100
n
l0 ≈
l1
≈
(a) (b)
Figure 4.5: Domains of example 9 represented in the (n,q) and (n,k) space. The
dots correspond with the (exact) discrete domain. The dashed lines indicate
the continuous subdomains. A dashed circle corresponds to a single point.
When d is large splitting intomodulo classes leads tomany domains
(high cost) each with only a few points of the exact discrete domain (a
distance d apart), probably with a large approximation error4. Instead
of introducing a variable q one can introduce a variable k and replace
the fractionals with (4.27) (Add Var, Table 4.4(c)). It can be seen that
both Add Var methods are equivalent, by using q = an − r − kd. The
accuracy is now improved by splitting the domain into subdomains
with constant integer k (Split Periods, Table 4.4(d), Figure 4.4(b)).
Example 9. (large period, small domain)
f(n) =
51
50
n− n
2
100
+2
{ n
100
}
n−2
{ n
100
}
−200
{ n
100
}2
0 ≤ n ≤ 100
In the same way as in the previous example we can define
pi =
51
50
n− n
2
100
+
ni
50
− i
50
− i
2
50
.
This leads to 99 domains with a single element (i ∈ [1, 99]) and the domain
n ∈ [0, 100] for i = 0 (Figure 4.5(a)). The computational effort is roughly the
same as evaluating the function in all integer points.
Fortunately, the fractional expression can be eliminated in a better way by
the substitution{ n
100
}
=
n− 100k
100
with 0 ≤ n− 100k
100
≤ 99
100
.
4Scaling with a factor 1/d allows to use the results of Section 4.5.
4.6 Converting quasi-polynomials into polynomials 99
Table 4.5: Maxima using different methods on several examples.
Ex. 8 Ex. 9 Ex. 10 Ex. 11
Exact 103 25 2 98020000
Modulo Classes 103 26 12 98020000
Split Periods 103 25 20 98020000
Add Var 104 51 120 99020000
The domain of n is split into parts with constant integer k, called periods.
This results in
f(n) =
{
l0(n) = n− n2100 if 0 ≤ n ≤ 99
l1(n) = − n2100 + 3n − 198 = 2 if n = 100 .
Figure 4.5(b) shows the domains in the (k,n) space.
In example 8 this method would result in a lot of overhead. 68 domains
([0], [1], [2, 3], [4], [5, 6], . . . , [97], [98, 99], [100] for k = −1, 0, 1, 2, 3, . . . ,
64, 65, 66) with a small size (1 or 2). This becomes better after the substitution{
2n−1
3
}
= 1 − {n−23 }, by which an expression with a = 1 can be obtained.5
Still this method would lead to 34 subdomains ([0, 1], [2, 4], . . . , [98, 100] for
k = −1, 0, . . . , 32) with a size of 2 or 3. Remember that small subdomains
may lead to large approximation errors when the continuous-domain extrema
are used (cf. Section 4.5). However, the error can not be larger than in the case
no evaluation for the discrete values of k is done.
Exact (Table 4.4(e)) evaluates the quasi-polynomial in every point of
the discrete domain.
Example 10. (small period, small domain)
f(n) = 1440
{n
2
}2
n2 − 5760
{n
2
}3
n+ 5760
{n
2
}4
+ 2880
{n
2
}3
−3208
{n
2
}2 − 720{n
2
}
n2 + 1480
{n
2
}
n
+104
{n
2
}
− 10n2 + 20n+ 2 0 ≤ n ≤ 2
Evaluation in discrete points (f(0) = f(1) = f(2) = 2 = F d), gives the exact
extrema with little cost. The other methods lead to bad results (Table 4.5).
5A smaller value of a always results in less domains. One can always make a ≤ d
2
.
100 Bounds on quasi-polynomials for static program analysis
0
1
2
98
99
100
q
−1 0 1 2 98 99 100
k
≈
≈ ≈ ≈ ≈ ≈
≈
≈
0
1
2
98
99
100
q
−1 0 1 2 98 99 100
k
p0
p1
p3
≈
≈ ≈
p2
≈ ≈
≈
≈
(a) (b)
Figure 4.6: Domain (a parallelogram) of example 11 represented in the (k,q)-
space with n = 100k + q + 2, 0 ≤ q ≤ 99. In (b) the domain is split into a
rectangle, one line segment and two points. Note that only for points with
integer values for both k and q, we can say k =
⌊
n−2
100
⌋
, q = n− 2 mod 100.
Example 11. (large period, large domain)
f(n) = n2
{
n− 2
100
}
+ 2n 0 ≤ n ≤ 10000
Mod Classes and Split Periods both find the exact maximum F d =
98020000 = f(10000), by splitting the domain into 100modulo classes or 101
periods, respectively. Add Var makes a little error of 1%, F c = 99020000.
The domain is shown in Figure 4.6(a), using the (k,q)-space (q = n− r −
kd) for reasons of compactness. It is bounded by the inequalities
0 ≤ q ≤ 99, 0 ≤ 100k + q + 2 ≤ 10000 .
This is a parallelogram, which may not be apparent from the figure due to the
broken axes, with vertices (-0.02,0), (99.98,0), (-1.01,99) and (98.99,99).
The introduction of k makes it possible to increase the accuracy by evalu-
ating k in discrete values, but it is not necessary to divide the domain into 101
parts. In this example it suffices to split the domain into the main rectangle,
a line segment and two points at the borders (Figure 4.6(b)). By doing this,
the region on the top right, which caused the overestimate, is removed and we
obtain F c = F d. This can be done in the (n, k)-space but this would be harder
to visualize here.
Multiple fractional expressions can be addressed by introducing
multiple variables (qi or ki). Now, the decision to evaluate a qi or ki in
4.6 Converting quasi-polynomials into polynomials 101
discrete values can be made for the different variables independently,
with a trade-off between accuracy and computation time. One could
roughly say that a small denominator, d, results in a small range of q,
and a small domain, compared to d, results in a small range of k.
Exact and Split Periods can only be used for (iteration over) fixed do-
mains. The other two allow parametric domains. Bernstein expansion
(Section 4.4, Appendix C) can be used to compute parametric bounds
on parametric polynomials over parametric continuous domains.
4.6.3 Other methods
In the experiments the techniques described above will be compared
with two other methods that address the problem of converting quasi-
polynomials into polynomials.
Poly Approx
In Section 4.2 it was mentioned that for integer polytopes the quasi-
polynomial is actually a polynomial. In [127] B. Meister studies the pe-
riodicity of coefficients in quasi-polynomials. This results in methods
to transform a polytope into an integer polytope, i.e. with its number
of integer points expressed as a regular polynomial. More in general, a
counting problem with a quasi-polynomial as solution can be approxi-
mated by a similar counting problem that has a polynomial as solution.
In the experiments we will use the “orthogonal expansion with in-
flation before expansion” technique presented in [128].
Drop Frac
In this method each fractional expression that appears in the coefficient
of a monomial is replaced by either 0 or (d− 1)/d (its extremal values),
depending on the sign of the monomial. If needed, the subdomains
are divided into orthants,6 such that all monomials have constant signs
within a subdomain. For example,
−34n ≤
{
n
5
}
n2 − {n−24 }n+ {n+13 } ≤ 45n2 + 23 for n ≥ 0 ,
0 ≤ {n5}n2 − {n−24 }n+ {n+13 } ≤ 45n2 − 34n+ 23 for n ≤ 0 ,
6An orthant is the higher dimensional analogue of a quadrant in the plane.
102 Bounds on quasi-polynomials for static program analysis
since n2 and 1 are always positive (or zero) and the sign of −n is oppo-
site to the sign of n.
4.6.4 Implementation
The transformations of quasi-polynomials into polynomials according
to Table 4.4 and the methods explained in Section 4.6.3 have been im-
plemented (for the multivariate case) in (or using) the barvinok li-
brary [160]. The command line interface and options that correspond
with the methods used in the experiments can be found in Appendix E.
To improve the accuracy of the Mod Classes method, polynomials
with subdomains with less than four integer values are evaluated in
discrete points.
Split Period is called with a threshold parameter, Th . If the number
of values a variable ki can reach is larger than Th , no subdomains are
created for each value of ki and the method is the same as Add Var (for
this ki only). This provides a trade-off between accuracy and computa-
tion time.
Incremental Bernstein expansion (see Appendix C) is used to find
the extrema of (parametric) polynomials over continuous (parametric)
domains. This means that an approximation error caused by the dif-
ference between the maximal (minimal) Bernstein coefficient and the
continuous-domain maximum (minimum) of the polynomial is added
to the error caused by the difference between the continuous- and
discrete-domain extrema. This extra approximation error is in many
cases larger than the error from the transition to continuous-domain
extrema. For example 10, the results listed in Table 4.5 would be 38
instead of 20 for Split Periods and even 284 instead of 120 for Add Var.
Until now the same method is applied on all fractional expres-
sions of one piecewise quasi-polynomial. A hybrid method could
be constructed that would decide at run-time for each fractional ex-
pression and in each chamber independently which method to use.
This decision could be based on the denominator of the fractionals
and the domain size. In Section 4.8 simple hybrid methods will be
presented, which choose in each chamber independently between the
Exactmethod and another method, based on the domain size.
4.7 Memory size estimation experiments 103
For n ∈ {L− 1, . . . , N − 1}:
cn = (a ∗ b)n
=
L−1∑
k=0
an−kbk .
(a)
for (i=L-1;i<=N-1;i++){
for (k=0;k<=L-1;k++){
C[i]=C[i]+A[i-k]*B[k]; // S1(i,k)
}
}
(b)
Figure 4.7: Definition (a) and corresponding C code (b) of a 1-D FIR filter.
4.7 Memory size estimation experiments
As simple test cases we study the memory usage of a one-dimensional
FIR-filter and a matrix multiplication. Incremental Bernstein expansion
(see Appendix C) is used to find the extrema of (parametric) polynomi-
als over continuous (parametric) domains.
All measurements are performed on a PC with an AMD Athlon XP
Processor running at 1830 MHz.
4.7.1 The 1-D FIR-filter test case
The input signalAwith lengthN is filteredwith the filterB with length
L, as shown in the code fragment in Figure 4.7. Similar to the example
in Section 4.1 we will use extrema of quasi-polynomials to examine the
memory usage of several affine schedules of this program.
The iteration domain of the assignment statement S1 is given by:
DS1 = {(i, k) ∈ Z2 | L− 1 ≤ i ≤ N − 1, 0 ≤ k ≤ L− 1} .
1-Dimensional schedule
We now consider affine 1-dimensional schedules of the iteration do-
main θS1(i, k):
S1(i1, k1) is executed before S1(i2, k2) iff θS1(i1, k1) < θS1(i2, k2) .
(4.30)
104 Bounds on quasi-polynomials for static program analysis
Table 4.6: Experiments each estimating the memory usage of a 1-D FIR-filter,
for 20 different 1-D schedules, defined by the vector (u, v). L = 50, N = 100.
(a) Experiment over 20 schedule vectors (u, v), with u ∈ Z0 ∩ [−10, 10], v = 1.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 1.3 0.06 99 197.00 0.00
Add Var 1.5 0.08 99 197.00 5.06
Mod Classes 2.3 0.11 99 197.00 0.22
Split Periods Th=10 1.4 0.07 99 197.00 2.62
Split Periods Th=100 3.7 0.19 99 197.00 0.22
Poly Approx 1.0 0.05 99 198.80 74.42
Drop Frac 0.9 0.05 99 2477.22 1079.81
(b) Experiment over 20 schedule vectors (u, v), with u = 1, v ∈ Z0 ∩ [−10, 10].
Method T(s) T/Sched.(s) Min Max RMSE
Exact 1.3 0.06 100 197.00 0.00
Add Var 1.8 0.09 100 197.00 5.34
Mod Classes 2.3 0.12 100 197.00 0.00
Split Periods Th=10 1.7 0.08 100 197.00 3.01
Split Periods Th=100 4.5 0.23 100 197.00 0.06
Poly Approx 1.0 0.05 100 202.80 76.52
Drop Frac 0.9 0.05 100 1410.89 715.47
Similar to (4.2), the set of live elements of array A at an execution point
t is the set of elements that is accessed both before and after the point t:
LEA(t) = {j ∈ Z ∩ [0,N − 1] | ∃(i1, k1), (i2, k2) ∈ DS1 :
j = i1 − k1 = i2 − k2 ∧
θS1(i1, k1) < t ∧ t ≤ θS1(i2, k2)} .
(4.31)
The live elements of array B and C , i.e. LEB(t) and LEC(t), can be
defined similarly. According to Section 4.2 the number of live elements
of A, i.e. |LEA(t)|, can be expressed as a piecewise quasi-polynomial in
t. The required memory for a schedule is the maximal number of live
elements:
max
t
(|LEa(t)|+ |LE b(t)|+ |LE c(t)|) .
In the experiments we use one-dimensional schedules of the form
θS1(i, k) = ui+ vk.
Tables 4.6 and 4.7 list the results of the measurements. All quasi-
polynomials are of degree 2. The experiments each estimate the mem-
ory usage of a 1-D FIR-filter, for 20 different 1-D schedules, defined by
4.7 Memory size estimation experiments 105
Table 4.7: Experiments each estimating the memory usage of a 1-D FIR-filter,
for 20 different 1-D schedules, defined by the vector (u, v). L = 50, N = 1000.
(a) Experiment over 20 schedule vectors (u, v), with u ∈ Z0 ∩ [−10, 10], v = 1.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 11.0 0.55 99 197.00 0.00
Add Var 1.5 0.08 99 197.00 5.12
Mod Classes 2.3 0.11 99 197.00 0.32
Split Periods Th=10 1.4 0.07 99 197.00 2.65
Split Periods Th=100 1.7 0.08 99 197.00 2.65
Poly Approx 1.0 0.04 99 198.80 74.48
Drop Frac 0.9 0.05 99 24077.22 10113.59
(b) Experiment over 20 schedule vectors (u, v), with u = 1, v ∈ Z0 ∩ [−10, 10].
Method T(s) T/Sched.(s) Min Max RMSE
Exact 3.0 0.15 100 1079 0
Add Var 1.1 0.06 100 1080.15 0.99
Mod Classes 2.0 0.10 100 1079 0
Split Periods Th=10 1.0 0.05 100 1080.15 0.94
Split Periods Th=100 4.9 0.24 100 1079 0
Poly Approx 1.0 0.05 100 1961 525.51
Drop Frac 0.9 0.04 100 2877.9 1042.92
the vector (u, v). The time, T , to compute the estimates includes both
the generation and the bounding of the quasi-polynomials, summed
over all 20 schedules. Min and Max are the extremal results (memory
requirements of the schedules with the best and worst memory usage,
respectively) over the 20 schedules and RMSE is the root mean square
error over the 20 schedules of the results compared with the exact solu-
tions.
Timing results for experiments with larger values of N (Figure 4.8)
illustrate that the execution time of the Exactmethod is proportional to
the domain size.
2-Dimensional schedule
A 2-D schedule is described by a 2× 2matrixM :
θS1(i, k) =M
[
i
k
]
.
106 Bounds on quasi-polynomials for static program analysis
1
10
100
1000
100 1000 10000 100000
T(
s)
N
Add Var v = 1
Add Var u = 1
Exact v = 1
Exact u = 1
Figure 4.8: Time used for estimating the memory usage of a 1-D FIR filter as a
function of N for 20 1-D schedules (cf. Table 4.6 and 4.7). L = 50.
The ordering relations < and ≤ in (4.30) and (4.31) have to be re-
placed with the lexicographical ordering ≺ and , respectively. They
can be transformed into ordinary inequalities by using Presburger for-
mulas. For example,
(i1, j1)  (i2, i2)⇔ i1 < i2 ∨ (i1 = i2 ∧ j1 ≤ j2) .
Table 4.8 lists the results of 2 experiments. Note that the schedules in
Table 4.8(a) are a subset of the schedules in Table 4.8(b).
A note on dealing with commutativity
Since the addition is commutative the products after the summation
sign
∑
in Figure 4.7(a) can be calculated and summed in any order (cf.
4.7 Memory size estimation experiments 107
Table 4.8: Experiments each estimating the memory usage of a 1-D FIR-filter,
for 2-D schedules, defined by the matrixM . L = 50,N = 100.
(a) Experiment over 48 schedules: M ∈ (Z ∩ [−1, 1])2×2 and det (M) 6= 0.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 20.2 0.42 99 197.0 0.00
Add Var 2.6 0.05 99 198.0 0.31
Mod Classes 3.1 0.06 99 197.0 0.00
Split Periods Th=10 2.6 0.05 99 197.0 0.00
Split Periods Th=100 3.6 0.08 99 197.0 0.00
Poly Approx 2.9 0.06 99 198.5 0.53
Drop Frac 3.7 0.08 99 198.5 0.48
(b) Experiment over 496 schedules: M ∈ (Z ∩ [−2, 2])2×2 and det (M) 6= 0.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 500.3 1.01 99 197.00 0.00
Add Var 90.1 0.18 99 221.25 19.78
Mod Classes 148.7 0.30 99 197.00 0.01
Split Periods Th=10 82.8 0.17 99 218.00 19.02
Split Periods Th=100 1247.2 2.51 99 221.56 2.75
Poly Approx 84.4 0.17 99 272.00 30.73
Drop Frac 86.9 0.18 99 343.25 44.74
Section 3.4.3). As a result, all schedules are legal, as long as invocations
with a same time stamp (θS1(i1, k1) = θS1(i2, k2)) do not write to the
same array element. In the experiments this is ensured by v 6= 0 and
det(M) 6= 0. The definitions of live elements, as in (4.31), are in ac-
cordance with this commutativity. It states that θS1(i1, k1) < θS1(i2, k2)
but it is not needed that (i1, k1) ≺ (i2, k2). Writing the code in a sin-
gle assignment form would add artificial dependences by ignoring the
commutativity of the addition. Therefore, this is not done here.
Suppose a statement invocation S2(j) reads the value of C[L− 1 +
j]. Since it is not known which invocation of S1 will produce the final
value of C[L − 1 + j], all invocations that write to this element should
be executed before the read by S2(j):
DS1δS2 = {(i, k, j) | (i, k) ∈ DS1, j ∈ DS2, : i = L− 1 + j} .
Since we work with linear schedules it suffices to check the depen-
dences for which k = 0 or k = L − 1. In the schedules considered
108 Bounds on quasi-polynomials for static program analysis
here it holds that
∀k ∈ {0, . . . , L− 1} : θS1(i, k) ≺ θS2(j)
m (4.32)
θS1(i, 0) ≺ θS2(j) ∧ θS1(i, L− 1) ≺ θS2(j) .
For the next example, matrix multiplication, the commutativity of the
addition is dealt with in a similar way.
Discussion of results
The results in Tables 4.6, 4.7 and 4.8 indicate there is a trade-off be-
tween accuracy and computation time. The time needed by the Exact
method, which iterates over all points, is proportional to the domain
size (Figure 4.8). Also Split Periods is influenced by this size, but in a
different way depending on the threshold. Typically, the error added
by the Bernstein expansion is larger than the error introduced by the
transition from discrete to continuous domains. Subdivision in smaller
domains may improve on this. Drop Frac gives the worst results. Each
fractional expression is set to 0 or (d − 1)/d. In Add Var each fractional
can also vary between 0 and (d − 1)/d, but here all fractionals of the
same linear expression will have the same value as they are connected
by one variable. Detailed results, not shown in the tables, indicate that
the Mod Classes method performs worse for larger periods of the peri-
odic numbers (denominators inside fractionals) and Add Var performs
relatively better for larger domains.
In the 2-D schedules the Exact method is much slower than for the
1-D schedules. This is caused by the fact that the iteration domains in
the 2-D case contain more points than in the 1-D case. In the former
each statement invocation occurs at a different value of the schedule
function θS1 (since det(M) 6= 0), while in the latter several invocations
are projected on the same logical execution time.
The 20 piecewise quasi-polynomials of the experiment in Table
4.7(a), have domains composed of in total 160 polynomial chambers.
Their sizes vary a lot (Figure 4.9). 140 chambers contain less than 64
points, while 20 chambers contain more than 512 points. However,
those 20 chambers contain 103326 points of the domains while the 140
small chambers only have 2244 points all together. As a result the ex-
ecution time of the Exact method is dominated by only 20 of the 160
chambers. In this example a hybrid method, using the Exactmethod for
4.7 Memory size estimation experiments 109
 0
 10
 20
 30
 40
 50
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
N
um
be
r o
f C
ha
m
be
rs
 in
 C
la
ss
Chamber size
Figure 4.9: Histogram of the chamber sizes for the 20 schedules of the 1-D FIR
filter considered in Table 4.7(a).
the small chambers and another method for the large chambers, could
be very beneficial, assuming that deciding if a domain is small or large
does not introduce a high extra cost. This will be tested in Section 4.8.
4.7.2 The matrix multiplication test case
Let A ∈ RM×L, B ∈ RL×N , C ∈ RM×N . Then, the matrix product
C = BA is defined as
cij =
L∑
k=1
aikbkj, for i ∈ {1, . . . ,M}, j ∈ {1, . . . ,N} .
CorrespondingC code is found in Figure 4.10. In (a) arrayC is assumed
to be initialized to 0, as is done by memory allocations in several pro-
gramming languages (not in C). In (b) a statement is added to do the
initialization.
110 Bounds on quasi-polynomials for static program analysis
for (i=1;i<=M;i++){
for (j=1;j<=N;j++){
for (k=1;k<=L;k++){
C[i][j] += A[i][k]*B[k][j]; // S2(i,j,k)
}
}
}
(a)
for (i=1;i<=M;i++){
for (j=1;j<=N;j++){
C[i][j] = 0; // S1(i,j)
for (k=1;k<=L;k++){
C[i][j] += A[i][k]*B[k][j]; // S2(i,j,k)
}
}
}
(b)
Figure 4.10: C code of a matrix multiplication. In (b) the initialization of array
C is made explicit.
Without initialization statement S1
We define 1-dimensional linear schedules of the form θS2(i, j, k) = ui+
vj + wk. The set of live elements of array A can be expressed as
LEA(t) = {(i, k) ∈ Z2 ∩ [1,M ]× [1, L] |
(θS2(i, 1, k) < t ∨ θS2(i, L, k) < t)
∧ (θS2(i, 1, k) ≥ t ∨ θS2(i, L, k) ≥ t)} ,
in which an existential variable j has been avoided by using a prop-
erty similar to (4.32). As already mentioned in Section 4.7.1, there are
little dependences that restrict the set of legal schedules, thanks to the
commutativity of the addition.
Table 4.9 lists the results of several sets of schedules. The domain
size, proportional to N , influences the computation time of several
methods. The time needed by Exact is of course proportional to the
domain size. The threshold of the Split Periodmethod offers a trade-off
between accuracy and computation cost. The accuracies ofMod Classes
and Split Periods are always between those of Exact and Add Var. Mod
Classes becomes faster whenN increases since less modulo classes have
less than 4 elements and are evaluated in all points. The time used by
Add Var, Poly Approx and Drop Frac is independent of the domain size.
4.7 Memory size estimation experiments 111
 160
 180
 200
 220
 240
 260
 280
 300
 0  20  40  60  80  100  120  140  160  180  200
M
em
or
y 
siz
e
Schedule
N=10
Exact
AddVar
ModClasses
Split Th=10
Split Th=100
PolyApprox
DropFrac
(a)
 16000
 18000
 20000
 22000
 24000
 26000
 28000
 30000
 0  20  40  60  80  100  120  140  160  180  200
M
em
or
y 
siz
e
Schedule
N=100
Exact
AddVar
ModClasses
Split Th=10
Split Th=100
PolyApprox
DropFrac
(b)
Figure 4.11: Memory size requirements of a matrix multiplication without S1
as a function of the schedule θS2. u, v, w ∈ Z0 ∩ [−3, 3], gcd(u, v, w) = 1.
112 Bounds on quasi-polynomials for static program analysis
Table 4.9: Experiments each estimating the memory usage of a Matrix Mul-
tiplication without initialization statement S1. The 1-D schedules are defined
by the schedule vector (u, v, w).
(a) Experiment over 200 schedules, with u, v, w ∈ Z0 ∩ [−3, 3],
gcd(u, v, w) = 1, N = 10.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 10.8 0.05 160.00 210.0 0.00
Add Var 15.0 0.08 161.00 285.0 17.92
Mod Classes 28.4 0.14 160.00 285.0 14.67
Split Periods Th=10 19.5 0.10 160.00 285.0 10.13
Split Periods Th=100 19.5 0.10 160.00 285.0 10.13
Poly Approx 14.2 0.07 173.33 285.0 28.99
Drop Frac 15.3 0.08 162.67 285.5 20.10
(b) Experiment over 200 schedules, with u, v, w ∈ Z0 ∩ [−3, 3],
gcd(u, v, w) = 1, N = 100.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 27.9 0.14 16600.00 22350 0.00
Add Var 15.0 0.08 16601.00 29850 2459.26
Mod Classes 19.0 0.10 16600.00 29850 2420.90
Split Periods Th=10 14.8 0.07 16601.00 29850 2459.30
Split Periods Th=100 121.7 0.61 16600.00 29850 1441.06
Poly Approx 14.2 0.07 16733.33 29850 2573.34
Drop Frac 15.3 0.08 16602.67 29850 2451.88
(c) Experiment over 920 schedules, with u, v, w ∈ Z0 ∩ [−5, 5],
gcd(u, v, w) = 1, N = 10.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 65.0 0.07 136.0 212.00 0.00
Add Var 658.5 0.72 138.0 285.00 17.28
Mod Classes 234.0 0.25 136.0 285.00 7.18
Split Periods Th=10 154.0 0.17 136.0 285.00 4.73
Split Periods Th=100 154.0 0.17 136.0 285.00 4.73
Poly Approx 77.8 0.08 152.0 285.81 33.12
Drop Frac 82.0 0.09 149.5 622.85 82.68
For larger N the relative difference in accuracy between the methods
becomes smaller.
In Table 4.9(c) schedules with larger coefficients of the iterators in
the schedule function are present. As a result, the average denominator
4.7 Memory size estimation experiments 113
of the fractional expressions, i.e. the period of the periodic numbers,
increases. This is in favor of the Split Periods and to the detriment of the
Mod Classes method.
The tables only give information about the average absolute accu-
racy and computation time of the methods. When searching for the
best schedules it is more important that the relative order of the needed
memory size as a function of the schedule is preserved. Therefore, in
Figure 4.11, the results of the different methods were plotted as a func-
tion of the schedules sorted along the exact memory size.
The first 24 schedules are recognized by all methods as having the
lowest memory demand. For the other schedules the relative ordering
is roughly preserved if N = 100. For N = 10, especially Drop Frac has
large deviations.
With initialization statement S1
Now, two schedule functions are needed
θS1(i, j, k) = ai+ bj + c
θS2(i, j, k) = ui+ vj + wk .
There is one schedule restriction: each element of C should be ini-
tialized by the corresponding invocation of S1 before its first read in an
invocation of S2:
θS1(i, j) < θS2(i, j, k)
ai+ bj + c < ui+ vj + wk
c < (u− a)i+ (v − b)j + wk
c ≤ min
i,j,k
((u− a)i+ (v − b)j + wk)− 1 .
For given a, b, u, v and w, the maximal value of c can be expressed as
a function ofM , N and L. Choosing c smaller than this maximal value
can only increase the memory size requirements by augmenting the
time between initialization and usage of the array elements. Therefore,
the used schedules will be defined by a, b, u, v and w from which the
114 Bounds on quasi-polynomials for static program analysis
Table 4.10: Experiments each estimating the memory usage of a Matrix Mul-
tiplication with initialization statement S1. The 1-D schedules are defined by
the vectors (a, b, c) and (u, v, w).
(a) Experiment over 32 schedules, with a, b, u, v, w ∈ {−1, 1},N = 100.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 3.5 0.11 22400 31257 0.00
Add Var 1.6 0.05 29530 39451 5788.26
Mod Classes 1.6 0.05 29530 39451 5788.26
Split Periods Th=10 1.6 0.05 29530 39451 5788.26
Split Periods Th=100 1.6 0.05 29530 39451 5788.26
Poly Approx 2.2 0.07 29724 39649 5944.98
Drop Frac 2.1 0.07 29530 39850 5865.26
(b) Experiment over 672 schedules, with a, b, u, v, w ∈ Z0 ∩ [−2, 2],
gcd(a, b) = 1, gcd(u, v, w) = 1, N = 100.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 103.8 0.15 19899 31257 0.00
Add Var 49.3 0.07 19899 39525 2861.27
Mod Classes 57.6 0.09 19899 39525 2826.65
Split Periods Th=10 48.7 0.07 19899 39525 2861.27
Split Periods Th=100 294.8 0.44 19899 39525 1931.47
Poly Approx 57.8 0.09 19999 39825 2949.43
Drop Frac 61.0 0.09 19899 39850 2832.65
(c) Experiment over 5600 schedules, with a, b, u, v, w ∈ Z0 ∩ [−3, 3],
gcd(a, b) = 1, gcd(u, v, w) = 1, N = 100.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 1083 0.19 16600 31481 0.00
Add Var 684 0.12 16601 39525 2076.05
Mod Classes 933 0.17 16600 39525 2026.44
Split Periods Th=10 680 0.12 16601 39525 2075.97
Split Periods Th=100 4767 0.85 16600 39525 980.15
Poly Approx 566 0.10 16733 39833 2184.61
Drop Frac 593 0.11 16602 39850 2063.38
value of c is computed as
c = min
i∈[1,M ]
j∈[1,N ]
k∈[1,L]
((u− a)i+ (v − b)j + wk)− 1
= (u− a)(u− a > 0 ? 1 : M) + (v − b)(v − b > 0 ? 1 : N)
+w(w > 0 ? 1 : L)− 1 .
4.8 Hybrid methods 115
Table 4.10 lists the results of several sets of schedules. The conclu-
sions that can be drawn are similar to the case without initialization.
The average computation time per schedule increases as the average
denominator increases (from (a) over (b) to (c)).
4.8 Hybrid methods
The results of the experiments in Section 4.7 showed that the Exact
method is sometimes the slowest and sometimes the fastest depending
on the sizes of the domains. The domain sizes can vary a lot between
experiments but also between chambers within an experiment as indi-
cated in Figure 4.9.
Therefore, we constructed and tested hybrid methods. In chambers
with a size smaller than or equal to a threshold parameter Thhybrid the
Exact method is used. In larger chambers one of the methods used in
Section 4.7 is used. The experiments leading to the results in Table 4.6,
4.7 and 4.9 were repeated with hybrid methods to produce the results
listed in Table 4.11, 4.12 and 4.13, respectively. If the method used in
the larger chambers is also the Exact method the execution time indi-
cates the overhead caused by the selection method. For example using
Table 4.9(a) and 4.13(a), we get
Toverhead = THybrid-Exact − TExact = 11.6 s− 10.8 s = 0.8 s .
Here, the overhead is less than 10% of the total execution time. In the
other examples it is even less.
We will call the methods used in Section 4.7 basic methods. Each
basic method has a corresponding hybrid method. When comparing
the tables in Section 4.7 with the tables in this section the following can
be observed:
• The accuracy of a hybridmethod is always higher than, or as high
as, the accuracy of the corresponding basic method.
• In experiments where the Exactmethod is one of the fastest basic
methods (Tables 4.6(a,b), 4.9(a and c)) all hybrid methods have an
execution time close to that of the Exact method, except for Split
Periods with Th = 100. The accuracy is also good, except for the
hybrid methods using Poly Approx or Drop Frac.
116 Bounds on quasi-polynomials for static program analysis
• In experiments where the Exact method is one of the slowest ba-
sic methods Tables 4.7(a,b), 4.9(b)) only slight improvements in
accuracy and slight differences in execution time are visible when
going from a basic method to the corresponding hybrid method.
This is caused by the fact that only a little fraction of the cham-
bers have a size smaller than the hybrid threshold. In fact, these
are the experiments where Add Var, Mod Classes and Split Periods
are fast and accurate.
• A hybrid method can be faster than all basic methods, excluding
Poly Approx and Drop Frac (Table 4.11 and 4.12).
These observations prove that the selection mechanism of the hybrid
methods works well.
The hybrid method could be extended with a mechanism that se-
lects which method to use in the large domains. Choosing between
Add Var, Mod Classes and Split Periods is more complex than between
Exact and one of these three methods. Such a method should study
algebraic properties of the quasi-polynomials, such as degree and pe-
riods of periodic numbers and should have a selection heuristic based
on these properties. The differences in accuracy and computation time
between these three methods are much smaller than between one of
these methods and the Exact method. As a result, the gain that can be
obtained is smaller than the gain obtained by using the Exact method
for small domains as presented in this section. Therefore, no effort was
spent on extending the selection mechanism of the hybrid methods.
4.9 Related work
4.9.1 Memory size estimation
Computing the exact memory usage of a program, or bounding or ap-
proximating it has been the subject of numerous papers. A selection is
given in [56]. Here only two methods are mentioned because of their
relation to the methods presented in this chapter.
Zhu et al. [169, 168, 22] developed a method that uses polyhedral
techniques. For each array the references are decomposed into disjoint
linearly bounded lattices. The program code is regarded as a sequence
of loop nests, called blocks. The memory size between the blocks is
4.9 Related work 117
Table 4.11: Experiments each estimating the memory usage of a 1-D FIR-filter,
for 20 different 1-D schedules, defined by the vector (u, v). L = 50, N = 100.
A hybrid method is used. For chambers with a size ≤ 50 = Thhybrid the Exact
method is used. For larger chambers the method listed in the first column is
used. The experiments are similar to those used for constructing Table 4.6.
(a) Experiment over 20 schedule vectors (u, v), with u ∈ Z0 ∩ [−10, 10], v = 1.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 1.3 0.07 99 197 0
Add Var 1.1 0.05 99 197 2.62
Mod Classes 1.1 0.05 99 197 0.22
Split Periods Th=10 1.1 0.05 99 197 2.62
Split Periods Th=100 3.0 0.15 99 197 0.22
Poly Approx 1.0 0.05 99 198.8 74.42
Drop Frac 1.0 0.05 99 2477 1079.32
(b) Experiment over 20 schedule vectors (u, v), with u = 1, v ∈ Z0 ∩ [−10, 10].
Method T(s) T/Sched.(s) Min Max RMSE
Exact 1.3 0.07 100 197 0
Add Var 1.2 0.06 100 197 3.01
Mod Classes 1.1 0.06 100 197 0
Split Periods Th=10 1.2 0.06 100 197 3.01
Split Periods Th=100 3.7 0.18 100 197 0
Poly Approx 1.1 0.05 100 202.8 76.52
Drop Frac 1.0 0.05 100 1410 714.75
computed from the linearly bounded lattices using Barvinok’s algo-
rithm. To determine the memory usage inside a block, the minimal
and maximal iteration vectors that access a data element are searched.
This is done for all elements of the corresponding linearly bounded lat-
tice. This last step makes the method unsuitable for parametric prob-
lems and makes the computation time increase with the domain size,
although less than other methods.
This disadvantage is taken as an argument for the methods pre-
sented in [56]. Bernstein expansion does not suffer from these disad-
vantages and allows to find parametric bounds. However, in the pa-
per only polynomials are considered. With a reference to the methods
of Meister [127], dealing with quasi-polynomials is avoided. This has
been the motive for the work presented in this chapter.
118 Bounds on quasi-polynomials for static program analysis
Table 4.12: Experiments each estimating the memory usage of a 1-D FIR-filter,
for 20 different 1-D schedules, defined by the vector (u, v). L = 50, N = 1000.
A hybrid method is used. For chambers with a size ≤ 50 = Thhybrid the Exact
method is used. For larger chambers the method listed in the first column is
used. The experiments are similar to those used for constructing Table 4.7.
(a) Experiment over 20 schedule vectors (u, v), with u ∈ Z0 ∩ [−10, 10], v = 1.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 11.2 0.56 99 197 0
Add Var 1.1 0.05 99 197 2.65
Mod Classes 1.1 0.05 99 197 0.32
Split Periods Th=10 1.1 0.05 99 197 2.65
Split Periods Th=100 1.1 0.05 99 197 2.65
Poly Approx 1.0 0.05 99 198.8 74.48
Drop Frac 1.0 0.05 99 24077 10113.19
(b) Experiment over 20 schedule vectors (u, v), with u = 1, v ∈ Z0 ∩ [−10, 10].
Method T(s) T/Sched.(s) Min Max RMSE
Exact 3.0 0.15 100 1079 0
Add Var 0.9 0.05 100 1079 0
Mod Classes 1.2 0.06 100 1079 0
Split Periods Th=10 0.9 0.05 100 1079 0
Split Periods Th=100 4.6 0.23 100 1079 0
Poly Approx 1.0 0.05 100 1961 525.51
Drop Frac 0.9 0.04 100 2877 1042.53
4.9.2 Profiling
In this chapter static analysis methods are used to measure or estimate
certain properties of an implementation. Another option is profiling,
i.e. running an implementation and measuring the properties one is
interested in. To be useful a profiling method has to do more than
only measuring. It has to link the measurements with the data sets
and pieces of code that are responsible for the results [123].
Typically, profiling slows down the execution and increases the
memory usage with a nonnegligible factor. In [38] a statistical model
is used to reduce this overhead. Only a subset of the memory accesses
is sampled to estimate the reference distances7. Sampling windows re-
7The reference distance, i.e. the number of accesses between use and reuse, is eas-
ier to measure than the reuse distance, the number of different elements accessed (see
Appendix B).
4.9 Related work 119
Table 4.13: Experiments each estimating the memory usage of a Matrix Mul-
tiplication without initialization statement S1, using a hybrid method. For
chambers with a size ≤ 50 = Thhybrid the Exact method is used. For larger
chambers the method listed in the first column is used. The 1-D schedules are
defined by the schedule vector (u, v, w). The experiments are similar to those
used for constructing Table 4.9.
(a) Experiment over 200 schedules, with u, v, w ∈ Z0 ∩ [−3, 3],
gcd(u, v, w) = 1, N = 10.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 11.6 0.06 160 210 0
Add Var 11.5 0.06 160 210 0
Mod Classes 11.5 0.06 160 210 0
Split Periods Th=10 11.5 0.06 160 210 0
Split Periods Th=100 11.5 0.06 160 210 0
Poly Approx 14.5 0.07 173 226 10.81
Drop Frac 15.3 0.08 162 285 9.23
(b) Experiment over 200 schedules, with u, v, w ∈ Z0 ∩ [−3, 3],
gcd(u, v, w) = 1, N = 100.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 29.0 0.15 16600 22350 0.00
Add Var 15.7 0.08 16601 29850 2459.26
Mod Classes 19.7 0.10 16600 29850 2420.90
Split Periods Th=10 15.7 0.08 16601 29850 2459.26
Split Periods Th=100 117.9 0.59 16600 29850 1441.06
Poly Approx 14.5 0.07 16733.3 29850 2573.34
Drop Frac 15.5 0.08 16602.7 29850 2451.87
(c) Experiment over 920 schedules, with u, v, w ∈ Z0 ∩ [−5, 5],
gcd(u, v, w) = 1, N = 10.
Method T(s) T/Sched.(s) Min Max RMSE
Exact 68.6 0.07 136 212 0
Add Var 68.9 0.07 136 212 0
Mod Classes 70.5 0.08 136 212 0
Split Periods Th=10 69.0 0.07 136 212 0
Split Periods Th=100 68.7 0.07 136 212 0
Poly Approx 79.2 0.09 152 232 18.88
Drop Frac 88.2 0.1 149 622 80.23
duce the error when the miss ratio is not constant during the execution.
The ATOMIUM tool suite [47] is used in the IMEC DTSE method-
120 Bounds on quasi-polynomials for static program analysis
ology (Data Transfer and Storage Exploration). It profiles the memory
accesses and estimates the energy dissipation of an execution, for each
element in a set of possible memory hierarchies.
The SLO tool (Section 3.2.2) not only records reuse distance infor-
mation but also suggests, in a rough way, transformations that may
improve the locality.
A more detailed comparison of existing tools can be found in [43,
41].
Profiling is only possible if the architecture on which an implemen-
tation will run is available. This is not a problem when optimizing soft-
ware for a given processor. However, when targeting a future platform,
e.g., a processor that is still in a design phase or an implementation on
FPGA, such a platform does not exist yet. In some of these cases sim-
ulation is possible, which is orders of magnitude slower. Here, static
analysis is likely to be more profitable.
4.10 Conclusion
The differences between extrema of polynomials over continuous and
over discrete domains have been studied in this chapter. A bound on
the absolute difference for a given polynomial had already been de-
rived by Rivlin. In this section we presented a bound on the relative
difference for an arbitrary polynomial of a given degree over a given
interval.
The differences between the extrema over continuous and discrete
domains are smaller for larger domains or lower degrees. Considering
this property, novel methods were introduced for finding bounds on
(piecewise) quasi-polynomials in discrete domains, based on bounds
of polynomials in continuous domains. The estimation times are not
proportional to the domain size, in contrast with the exact method that
evaluates a quasi-polynomial in each discrete point of the domain.
The methods were implemented and tested by estimating the mem-
ory usage of several schedules of two small programs. For the 1-D FIR,
the presented methods perform quite well. However, for the matrix
multiplication there seems to be less advantage compared to existing
techniques, a.o., the evaluation in each point of the discrete domain.
Note that for multidimensional schedules or larger domains the latter
may become much slower.
4.10 Conclusion 121
By introducing a simple selection heuristic, selecting the Exact
method for small domains and another method for large domains,
we have demonstrated that the good characteristics of different meth-
ods can be combined. A more complex selection heuristic based on
other quasi-polynomial properties, in addition to the domain size, may
possibly improve on these hybrid methods, but probably with less
added value than this simple heuristic.
An advantage of Bernstein expansion is that symbolic maximiza-
tion is possible as suggested in [56]. This possibility has not yet been
thoroughly examined for the quasi-polynomial maximization. A disad-
vantage is that a large approximation error may be introduced by the
Bernstein expansion, much larger than the error introduced by the dif-
ference between the discrete-domain and continuous-domain extrema
of polynomials. The accuracy may be improved by subdivision of the
domains. Doing this in an efficient way for arbitrary polyhedral do-
mains is not trivial and may be a topic for future research.
Bounds on quasi-polynomials can result in, e.g., memory size esti-
mates of an implementation. Such estimates are needed to guide loop
transformations (Chapter 3). In the experiments in this chapter sets
of schedules were scanned exhaustively. This is similar to the way
Pouchet et al. [137] deal with small examples. The decoupling heuris-
tic they present decreases the number of schedules to evaluate. Since
this heuristic does not depend on the evaluation method (compilation
and running the program in their case, i.e. iterative compilation), the
heuristic can probably also be applied when evaluating a schedule with
bounds on quasi-polynomials.
122 Bounds on quasi-polynomials for static program analysis
Chapter 5
Hardware generation from
the polyhedral model
The previous chapters dealt with the search for and the application of loop
transformations to improve the locality of an algorithm. This was done in a
platform independent way. In this chapter the path towards a hardware imple-
mentation is considered, starting from the polyhedral model.
A hardware architecture with correspondence to the polyhedral model is pre-
sented. By automating the generation of control hardware, the influence of
loop transformations on the hardware can be investigated by exploring (part
of) the design space. One or more variants can then be selected for further re-
finement, optimizations and system integration.
The implementation of a 2D-IDWT was taken as a case study. Several vari-
ants without memory hierarchy were generated and compared with each other
and with designs made with the high-level synthesis tools Impulse C and the
Celoxica Handel-C compiler. One variant was further refined and integrated
with other components on a FPGA to measure performance and power dissi-
pation.
5.1 Introduction
As mentioned in Section 3.1.1 the manually designed RC-IDWT cannot
reach its maximal execution speed due to its high bandwidth require-
ments. A new version with lower bandwidth requirements is needed.
Loop transformations can improve the locality and thus reduce the data
traffic. After these transformations, the whole design trajectory from
124 Hardware generation from the polyhedral model
Datapath + Control
Statements
Memory
Loop Control
Figure 5.1: Top level representation of the hardware architecture.
high-level software description until hardware implementation has to
be redone. As will be demonstrated in this chapter loop transforma-
tions not only influence the data access pattern but also have an impact
on the control complexity and thus hardware implementation. Iterat-
ing over the design cycle, including both loop transformations and re-
finement to hardware, may be needed until the requirements are met.
Therefore, this work should be automated and integrated as much as
possible.
The class of applications that benefit from an implementation on
FPGAs or dedicated hardware typically share the following charac-
teristics: a low control complexity with little data dependence, much
low-level parallelism, computation and data intensive operations. The
restrictions required by the polyhedral model, static control, affine ex-
pressions, . . . , correspond well with the first characteristic. The other
two characteristics are not limited by the model. Therefore, most appli-
cations that may benefit from a hardware implementation, may fit into
the polyhedral model.
A shorter path from loop transformation to hardware allows de-
sign space exploration and creation of a library of variants of a design.
The latter is useful in systems that can be reconfigured dynamically.
Several variants of a design are then stored and the one that best fits
the working conditions at a certain time is loaded into the FPGA. Also
when an application has to run on different platforms with, e.g., differ-
ent computational power or memory bandwidth, generating variants
of a design may be beneficial.
In this chapter a hardware architecture is presented which corre-
sponds to the polyhedral representation. It is composed of two entities,
one with the statement implementations (both datapath and control)
and one with a controller that drives the iterators and triggers the state-
ments (Figure 5.1). Since loop transformations change the loop control
5.2 Hardware architecture 125
HWSW
statements
control
defines.h
statements.vhd
loop.vhd
Man./Autom.
CLooGVHDLLoop Xform
loop.cloog
Figure 5.2: CLooGVHDL extends the CLooG software generation process
(Figure 3.16) to create hardware.
and structure of an algorithm but leave the statement definitions un-
changed (cf. Figure 3.16), the focus is on the automatic generation of
the loop (nest) control hardware. The statement implementations can
be reused among different loop transformation results. On the other
hand several variants of the statements, e.g., trading-off area and num-
ber of clock cycles, can be connected to the same loop control entity. We
have written a back-end to the CLooG code generator, called CLooG-
VHDL (Figure 5.2), which fully automates the generation of a controller
implementation, consisting of communicating automata. The genera-
tion of the data path and statement control (control which does not de-
pend on the loop nest) is only partially automated, but as mentioned
above this is less problematic.1
5.2 Hardware architecture
Since in the polyhedral model the control and statements are separated
and recombined in the end, an architecture is proposed that has the
same composition. At a high level this looks like Figure 5.1. The mem-
ory is shown as one entity but may be split into several banks or ex-
tended to a full memory hierarchy as in Figure 1.6. This will be demon-
strated in Section 5.6.
In this section the architecture will be presented at a high level. A
more detailed view is given in Section 5.3.
1Techniques that could automate this part do already exist and are used in high-
level synthesis tools. Therefore, this part would imply a lot of implementation work or
integration with existing tools (not necessarily more feasible) before further research is
possible.
126 Hardware generation from the polyhedral model
IDx
S7S6S5S4S3S2S1
for i
ID1
ID2
for 1
for 2
m..nS
S1...7
Identifier
block signals
Control
Statement
implementation
Loop
counter
ID0
for j
for i for k
Program
0 1 2
0 1 20
10
1
Loop Control Entity
Figure 5.3: Example of a simplified abstract syntax tree (left) and hardware
architecture (right) of the corresponding loop controller. A single hardware
block implements all statements S1 . . .7. The block for 1 implements the loops
at depth 0: for i and for k. Which of the two is executed depends on the value
received from ID0
5.2.1 Loop control architecture
Sequential execution
First, a sequential execution of the statement instances, equivalent to
a software execution, is considered. In the abstract syntax tree in Fig-
ure 5.3, the numbers and iterators on the path between the top-node
Program and a statement node correspond to the elements of the sched-
ule vector of that statement. The controller is composed of a set of com-
municating automata, a so-called factorized implementation [19]. Each
automaton corresponds to one dimension of the schedule vector. This
results in two types of automata. A first type, the loop counter blocks,
is responsible for the iterators. for 1, e.g., drives i and k. The other type,
the identifier blocks, e.g., ID 1, corresponds to the elements of the or-
dering vectors. The loop counter blocks calculate the loop bounds and
stride in function of the parameters, the iterators of surrounding loops
and the more significant elements of the ordering vector. The identifier
5.2 Hardware architecture 127
Table 5.1: Comparison of implementation variants of the example in Fig-
ure 5.5 with dummy statements and fixed loop bounds (cols=8, rows=5). Word
width denotes the number of bits used to represent the iterator values and
some internal signals.
(a) Monolithic implementation
Word width LE fmax(MHz)
32 bit 269 116.90
4 bit 55 219.93
(b) Factorized implementation
Word width LE fmax(MHz)
32 bit 121 167.17
4 bit 47 232.29
blocks count from zero onwards to enumerate the different statements
and loops at the level below. ID 1, e.g., counts from 0 to 1 if the value
received from ID 0 is 0, and from 0 to 2 if that value is 1. When a state-
ment or loop at the level below has finished its execution, the counter
is incremented and the next statement or loop is triggered.
This architecture has, a.o., the following interesting properties.
• Experiments showed that the proposed factorized implementa-
tion consumes less area and reaches a higher clock frequency than
a monolithic control block. Table 5.1 gives an example.
• The higher in the hierarchy, the lower the switching activity of
the blocks. This allows clock gating [35] or other techniques for
power reduction.
• For deep and large loop nests some control blocks might become
very complex as they implement different behavior for a lot of dif-
ferent identifier values (components of ordering vectors). In that
case it may be beneficial to regard subtrees as programs in their
own right and give them their own control hardware. This results
in more, but smaller and faster automata and a trade-off between
clock speed and area. For example, in Figure 5.4(a) the for i and
for k loop get their own controllers. In principle the two loops
can now also operate in parallel. How the clock speed can be in-
creased by duplicating hardware will be shown in Section 5.5.
• Pipelining is possible by inserting registers between the au-
tomata, which may increase the maximal clock frequency (Sec-
tion 5.5).
128 Hardware generation from the polyhedral model
ID2’
ID1’
ID0
S7S6S5S4S3S1 S2
for 1’ for 1
for 2’
ID1’’
(a)
ID1
ID2’’
for 1
ID0
S4 S5 S6 S7S3S2S3S2S1
for 2’’
(b)
Figure 5.4: Duplicating hardware for reduction of complexity or introduction
of parallelism
For the case of sequential execution, CLooGVHDL can generate a
loop controller from a CLooG [27] input file.
Parallel execution
By creating separate hardware for different loops and the statements
they control, it is possible to execute them in parallel. Dependence
analysis using the polyhedral model allows to check which statement
instances can run in parallel. Thanks to the factorized implementation,
parallelization, if legal, is made straightforward by duplication of sub-
trees or statements. For example, in Figure 5.4(b) duplication of S2 and
S3makes it possible to run several iterations of the j loop in parallel or
in a pipeline.
The detailed elaboration of code generation for parallel execution is
left as future work. However, to prove that the proposed architecture
indeed allows parallel execution an example has been adjusted manu-
ally from a sequential to a parallel execution as will be shown in Sec-
tion 5.3.1.
5.3 Implementation details 129
5.2.2 Statement control and data path
Since in the case of sequential execution no two statements operate
at the same time, it is possible to share hardware among statements.
Therefore, in Figure 5.3 all statements are implemented as a single
VHDL process. The generation of the statements entity is not fully
automated yet. This is not a large problem since a loop transformation
only influences the controller entity and the statements entity can be
reused for several loop transformation variants.
Splitting the statements entity, as in Figure 5.4, may be beneficial
for similar reasons as for the control entity, i.e. parallelism and clock
frequency.
5.3 Implementation details
1 for (p3=0; p3<=1; p3++){
2 for (p5=1; p5<=cols−2; p5++){
3 S1 (0,0,p3,0,p5 ,0);
4 } /∗end for p5∗/
5 } /∗end for p3∗/
6 for (p3=2; p3<=rows−1; p3++){
7 for (p5=1; p5<=cols−2; p5++){
8 S1 (0,0,p3,0,p5 ,0);
9 S2 (0,1,p5,0,p3−1,0);
10 } /∗end for p5∗/
11 } /∗end for p3∗/
for p5 for p5
for p3for p3
Program
S1 S2
1
0 10
0 0
0
t_2
t_1
t_0
(a) (b)
Figure 5.5: Example C output of CLooG (a) and corresponding AST (b). Since
S1 occurs at two places the structure in (b) is in fact not a tree anymore. Nev-
ertheless, we continue to use the name AST.
This section refines the architecture presented in the previous sec-
tion. To construct instances of the architecture presented here, sev-
eral tools have been built or used to partially automate the work. The
VHDL code generated by the tools contains a lot of redundancy (an
example is given in Appendix F). A lot of signals or ports may be un-
needed because their value is never read. Conditional assignmentsmay
have tens of cases from which only a few or even none differ. Modern
130 Hardware generation from the polyhedral model
start_s2
s1_lc
start_s1
s2_lc
S1 S2
ID2
seq_2_lc
start_seq_2
for 2
ID1
for_2_lc
for 1
for_1_lc
seq_1_lc
ID0
start_seq_0=start
seq_0_lc=lc
t_1
t_0
start_for_2
start_for_1
start_seq_1
Figure 5.6: Detailed view of the factorized hardware architecture, correspond-
ing to the code in Figure 5.5. Control signals between the automata are indi-
cated. The loop iterators p3 and p5, and the statement arguments are omitted.
The dotted lines indicate two possible positions to insert registers for pipelin-
ing.
synthesis tools have no problems with this redundancy and will opti-
mize away all unnecessary signals, simplify combinational logic and
propagate constants. Therefore, no attempt was made to generate more
optimal code, since we believe that FPGA vendor synthesis tools are
better in optimizing logic for the target FPGA. The code instead tries to
give full optimization freedom to the synthesis tools.
5.3 Implementation details 131
5.3.1 Loop controller generation
Sequential execution
Amore detailed view of the control architecture that corresponds to the
code in Figure 5.5 is shown in Figure 5.6. The full VHDL code is listed
in Appendix F).
The execution time of a statement invocation does not have to be
known at compile time and may vary during execution or between
variants of the statement implementations. A simple handshake be-
tween controller and statements makes this possible without losing
clock cycles. A start signal triggers the execution of a statement. The
last cycle signal becomes active one cycle before the statement finishes
its execution. In the case of a one cycle execution time the last cycle out-
put of a statement is thus simply connected to the incoming start signal.
The same protocol is used between the different control automata of the
loop controller.
This is demonstrated in Figure 5.7 for the example shown in Fig-
ure 5.5 and 5.6. The pulse on the start signal (= start seq 0) (cycle 5) is
directly passed to start for 1, start seq 1, start for 2 and start seq 2 (indi-
cated with arrows). Since t 0 = t 1 = t 2 = 0 (see Figure 5.5(b)) the
identifier block ID2 triggers statement S1 by raising start s1. Since in
this example the statement S1 has an execution time of one clock cycle,
the first cycle is also the last cycle, whichmeans that the value of start s1
is passed to s1 lc. The inner loop for t 0 = t 1 = 0 only contains the sin-
gle statement S1. Therefore the last cycle of S1 is also the last cycle of
the inner of the loop with iterator p5 and seq 2 lc becomes high. As a
result, p5 is incremented at the next clock cycle and the inner of this
loop is started by setting (i.e. holding) start seq 2 to ’1’, which is again
passed to start s1. . . seq 2 lc. Now p5 has reached its end value (for 2 ev
= 2, not shown in the diagram), and for 2 lc and seq 1 lc (since there is
only one loop at this level) are also raised. At the next clock cycle p3 is
incremented and the inner of this loop is executed as before. Two cycles
later, the first loop nest (line 1–5 in Figure 5.5(a)) has been executed and
t 0 is incremented to start the execution of the second loop nest (line 6–
11). Pulses on start signals are passed until start s1. Here S1 is followed
by S2. start s2 is thus raised one cycle after s1 lc becomes high. At the
same time t 2 is incremented. The execution time of S2 is two cycles.
Therefore, s2 lc becomes high one cycle later. When S2 has finished its
execution t 2 is reset, p5 is incremented and S1 is started again. This
132 Hardware generation from the polyhedral model
0 20 3010 cycles
t(ns)1000 200
Figure 5.7: Timing diagram of the signals indicated in Figure 5.6 without extra
registers between the automata. fclk = 146 MHz.
continues until the last statement invocation. The last pulse on s2 lc is
passed to lc as indicated with arrows. One can notice that at each clock
cycle one statement is executing. No cycles are lost in the modules that
pass the control signals.
A disadvantage of this protocol is that long combinational paths
may occur in deeply nested loops. Therefore, an option is provided to
5.3 Implementation details 133
t(ns)1000 200
0 20 3010 cycles40
Figure 5.8: Timing diagram of the signals indicated in Figure 5.6 with extra
registers between each identifier block and the loop counter below (indicated
with dotted lines in Figure 5.6). fclk = 192 MHz.
include registers between the different levels of the control automata,
leading to a pipelined implementation. These registers introduce extra
clock cycles but increase the achievable clock speed by dividing combi-
national paths into smaller parts. At high levels the switching activity
is low and the introduced clock cycle cost is fairly small. Experiments
in Section 5.5 and 5.6.1 (Table 5.10) examine the influence on area, clock
134 Hardware generation from the polyhedral model
frequency and execution time. Note that registers on signals that go up
in the hierarchy can be inserted independently from buffers on signals
that go down in the hierarchy.
A timing diagram of the example studied above with registers (also
called buffers) inserted between each identifier block and the loop
counter block below (dotted lines in Figure 5.6) is shown in Figure 5.8.
The start signals now need two cycles to descend from the top to the
bottom of the hierarchy (see arrows). For example, start seq 1 rises one
cycle after start seq 0 and one cycle before start seq 2 and start s1. In the
given example the effect of shortening the clock cycle is outweighed by
the increase in the number of cycles. This is due to the small number of
iterations in the inner loop (only two). For a larger number of iterations
the total execution time will be shorter than without extra registers.
Also the signals t 0 and t 1 can be buffered. As a result shifted ver-
sions of these signals are used in the lower levels of the controller hier-
archy (not shown in the figure). However, using an unbuffered version
will not violate the correct execution, since the value is only used after
a start pulse arrives and the value cannot change while blocks lower in
the hierarchy are active. The opposite is not allowed, e.g., buffering t 0
without buffering the start signals.
The generated VHDL expressions for the loop bounds are equiva-
lent to the expressions generated by CLooG. As a result, operators as
mod and div are only synthesizable (in an efficient way) for powers
of 2. Techniques as those presented in [171] could extend this synthe-
sizable subset. The Method of Differences technique (MoD) exploits the
repetitive evaluation of expressions in loops and uses differences of the
terms of an expression to calculate the next value. An integer division
by a constant can be converted into a multiplication with a constant
and a shift, i.e. using so-called scaled reciprocals. Also [149] presents
techniques to eliminate division and modulo operations, by inserting
conditionals and using algebraic axioms and loop transformations.
The interface between the controller and the statements consists of
start and last cycle signals together with the iterator values or possibly
the values of the original iterators, i.e. those before the application of
loop transformations. This corresponds with the statement arguments
generated by CLooG with the option “-cpp 1”. The interface to the
outer world (top of ID0) consists of a start, ready and last cycle port and
ports for all parameters. Input and output of data is done by statements
and not directly visible to the loop controller.
5.3 Implementation details 135
Table 5.2: Synthesis results of control hardware of the IDWT variants listed
in Table 3.6. RC HVs and LB VHs are variants with a smaller tile size, edge
1 instead of 4 in the smallest subband. The influence of the control optimiza-
tions is clearly visible (command line options “-f 1” (a) and “-f -1” (b)).
Dummy statements are connected to the generated loop controller. No opti-
mizations on the loop bound expressions have been applied (cf. Section 5.4).
Results obtained with Altera QuartusII 6.1. for the Stratix EP1S25F1020C5. In-
teger data types of 10 bit used for iterators and parameters. (LB VH (a) uses
244,736 bits of memory to reduce the number of LEs used).
#lines C LE fmax(MHz)
(a) (b) (a) (b) (a) (b)
RC 32 32 223 223 185.49 185.49
RC H 166 55 1041 787 80.90 85.46
RC HV 456 69 2449 2533 55.08 11.23
RC HVs 246 59 1768 1820 64.21 16.53
RC V 98 46 668 632 87.90 90.64
RC VH 400 57 2318 2515 54.78 11.58
VH LB 1887 75 6563 3218 14.29 9.69
LB 83 56 468 521 119.65 54.35
LB H 926 71 3410 866 40.48 72.53
LB HV 2765 73 8737 2003 29.71 13.62
LB V 425 64 1188 734 73.34 42.86
LB VH 5639 75 21820 3195 4.52 10.73
LB VHs 716 69 3869 2710 13.45 11.62
The code optimization options of CLooG have a large impact on
the generated VHDL. The synthesis results of the variants listed in Ta-
ble 3.6 are shown in Table 5.2. All elements of the schedule vector are
passed as arguments to the statements. As a result, only synthesiz-
ing the loop control entity would lead to unrealistic numbers of pins
(from 141 to 477) and timing results that are not representative for a
design with statement implementations on-chip. Therefore, dummy
statements are connected to the controller and only 15 pins remain. The
dummy statements write their arguments to registers (one set of regis-
ters for all statements). The logic AND of the output of these registers is
connected to output pins to avoid them to be optimized away. Variants
with a large number of lines of C code have a low clock frequency and
a high LE usage. However, the block-based variants generated with
the option “-f -1” (b) also appear to have a low clock frequency and
136 Hardware generation from the polyhedral model
relatively large numbers of LEs. Section 5.4 investigates the origins of
these bad synthesis results and presents methods to improve on these.
Parallel execution
Figure 5.9 shows a timing diagram corresponding to Figure 5.8 but with
parallel execution of S1 and S2. This is achieved by using the par sync
entity shown in Figure 5.10 and 5.11. This block has on its top side an
interface that looks like that of any other block or statement consid-
ered before: a start and a last cycle (lc) signal. At the bottom side it
drives two blocks B1 and B2 that could be statements, identifier blocks
or loop counter blocks. The lc signal becomes high one cycle before both
blocks have finished execution. Extension to more than two blocks is
straight-forward. Extra code is needed to drive the input arguments of
the blocks.
5.3.2 Statement control and data path generation
To automate the generation of the statement implementations, a tool
would be needed that parses the statement definitions to construct an
AST representation, does cycle and hardware assignment of the opera-
tions and generates synthesizable VHDL. To avoid the effort of build-
ing or customizing such a tool, an alternative was found in building
and using a code generator, called Steps2Process (see below), and writ-
ing ad-hoc Vim scripts. Still, user interaction is needed but much less
than in a fully manual design flow. In addition, a lot of work can be
reused when iterating over a design.
The control hardware inside the statements entity has to take care
of the timing of the operations on the data path and the accesses to
the memories. Steps2Process was written to generate a statements en-
tity starting from a file that contains the schedule information for each
statement. Here schedule means the operations that have to be per-
formed in consecutive cycles of a statement invocation. This differs
from but is similar to the usage of schedule until now, which meant the
ordering of statement invocations. Example input files are shown in
Figure 5.12(b) and (c). Each line (not starting with “--”) contains op-
erations that should be executed in one clock cycle. Lines starting with
“-- #”, are used to separate the different statement descriptions and
to direct conditional execution.
5.3 Implementation details 137
t(ns)1000 200
0 20 3010 cycles40
Figure 5.9: Timing diagram of the signals indicated in Figure 5.6 with extra
registers and parallel execution of S1 and S2. fclk = 186 MHz.
par_sync
B1 B2
lc
start_B2
B1_lc B2_lc
start
start_B1
Figure 5.10: Entity to drive two statements or parts of loop nests in parallel.
138 Hardware generation from the polyhedral model
1 LIBRARY ieee, work;
2 USE ieee.std logic 1164.all ;
3
4 entity par sync is
5 port(
6 clk , reset : in std logic ;
7 start : in std logic ;
8 lc : out std logic ;
9 B1 lc , B2 lc : in std logic ;
10 start B1 , start B2 : out std logic
11 ) ;
12 end par sync;
13
14 architecture rtl of par sync is
15 signal B1 ready, B2 ready : std logic ;
16 begin
17 start B1 <= start;
18 start B2 <= start;
19 lc <= (start and B1 lc and B2 lc) or
20 (not( start ) and (B1 lc or B2 lc) and
21 (B1 ready or B1 lc) and (B2 ready or B2 lc)) ;
22
23 ready: process(clk)
24 begin
25 if (clk=’1’ and clk’event) then
26 if ( reset = ’1’) then
27 B1 ready<= ’1’;
28 B2 ready<= ’1’;
29 elsif start = ’1’ then
30 −− Set to ’0’ unless one cycle execution time
31 B1 ready<= B1 lc;
32 B2 ready<= B2 lc;
33 else
34 −− Set to ’1’ when block has finished
35 B1 ready<= B1 ready or B1 lc;
36 B2 ready<= B2 ready or B2 lc;
37 end if ; −−else reset / start / lc
38 end if ; −−clk
39 end process;
40 end architecture rtl ;
Figure 5.11: VHDL code of an entity that drives two blocks (statements or
loop nests) in parallel.
5.3 Implementation details 139
#define S1(i) \
A[i]= \
(B[i]+B[i-1]+C[i]) \
*B[i*2];
for(i=1;i<=N;i++){
S1(i)
}
-- #define S1(i)
i:=S1_arg_0;
B_ad<=i;
B_ad<=i-1;
a0:=B_q; C_ad<=i;
a1:=B_q; B_ad<=i*2;
a2:=C_q;
a3:=B_q;
a4:=(a0+a1+a2)*a3;
A_ad<=i;A_we<=’1’;A_d<=a4;
(a) C code (b) Statement execution steps
-- #define S1(i)
i:=S1_arg_0;
-- #if (i=1)
B_ad<=i; C_ad<=i;
B_ad<=i-1;
a0:=B_q;a2:=C_q;B_ad<=i*2;
a1:=B_q;
a3:=B_q; b0:=(a0+a1+a2);
a4:=b0*a3;
A_ad<=i;A_we<=’1’;A_d<=a4;
-- #else
B_ad<=i; C_ad<=i;
a1:=a0; B_ad<=i*2;
a0:=B_q; a2:=C_q;
a3:=B_q; b0:=(a0+a1+a2);
a4:=b0*a3;
A_ad<=i;A_we<=’1’;A_d<=a4;
Mem CMem B
b0
Mem A
a4
a0 a1 a2 a3
(c) After optimizations (d) Resulting data path
Figure 5.12: Example of optimizations on an intermediate file describing state-
ment execution steps. Each line corresponds to one clock cycle. Memory is
pipelined and a data element (X q) is read two cycles after the address (X ad)
assignment. Arrays B and C are put in a separate memory and can thus be
read in parallel. B[i−1] can be reused from the previous iteration if i > 1. The
calculation is split into two cycles, shortening the critical path. The number of
cycles is reduced from 9 to 7 (8 for i = 1).
Consider the code in Figure 5.12(a). A memory access time of 1 cy-
cle and a separate memory for each array is assumed (Figure 5.13). By
140 Hardware generation from the polyhedral model
data
control
statements
datapath + control
loop control
mem C
mem B
mem A
Figure 5.13: Architecture with separate memory for each array.
Table 5.3: Synthesis results of the design in Figure 5.12.
Pins LE DSP fmax
blocks (MHz)
Loop Control 48 152 0 153.56
Statements (b) 98 196 2 123.11
Statements (c) 98 186 2 212.45
Loop Control + Statements (b) 96 324 2 123.17
Loop Control + Statements (c) 96 323 2 140.53
translating the C syntax of the operations in the statement definition
in Figure 5.12(a) to a VHDL syntax, and writing the array accesses as
consecutive memory accesses the code in (b) is obtained. Here several
optimizations are possible. Memory B and C can be read in parallel
and the element B[i − 1] should not be fetched from memory after the
first iteration. The addition and multiplication are spread over two cy-
cles to shorten the critical path. This results in the code in (c) with cor-
responding data path in (d). Synthesis results for the different entities
and their compositions are shown in Table 5.3.
To synchronize the execution with external events the statements
“-- #waitdo” (execute next line as soon as condition becomes true)
and “-- #waitndo” (execute next line one cycle after condition be-
comes true) can be used. These wait statements will be used in Sec-
tion 5.6.3 to ensure that the needed data is available in the buffer mem-
ories (after a read transaction from the main memory to the on-chip
buffers) or data in buffers is not overwritten before it is written to the
main memory.
5.4 Simplification of loop bound expressions 141
1: for i = 0,
⌊
8N+39
8
⌋
2: for j = max
(
0,
⌈
−i−2
7
⌉
,
⌈
−i
3
⌉)
,
min
(⌊
i+7M−5
7
⌋
,
⌊
4M−1
4
⌋
,
⌊
i+3M−3
3
⌋)
3: . . .
(a)
1: for i = 0, N + 4
2: for j = 0,M − 1
3: . . .
(b)
Figure 5.14: Part of the code of RC HVs, generated with the CLooG option
“-f -1”, before (a) and after (b) algebraic simplifications of the loop bounds.
5.4 Simplification of loop bound expressions
As observed in Table 5.2 using the CLooG option “-f -1” reduces the
size of the generated software code. However, in several cases the gen-
erated hardware becomes slower or even larger when using this option.
This deserves a closer investigation.
The two outer most loops in RC HVs generated with the option
“-f -1” are shown in Figure 5.14(a). In the strip-mine transformations
that are used in the loop transformation process only factors that are a
power of 2 are used. Nevertheless, the generated code shows divisions
by 3 and 7. These divisions are also present in the generated VHDL
code, which causes the bad synthesis results.2
Fortunately, these divisions can be eliminated using the equations
listed in Table 5.4. We will demonstrate this on the example in Fig-
ure 5.14(a). Using (5.1) the upper bound on i can be simplified:
⌊
8N + 39
8
⌋
=
⌊
8N + 8 · 4
8
⌋
=
⌊
N + 4
1
⌋
= N + 4 .
Since i ≥ 0 we know that −i ≤ 0 and −i − 2 ≤ 0 and thus (5.5) can be
2Divisions by integers that are not a power of 2 used to be unsynthesizable. How-
ever, the QuartusII tool has an integer divider in its library, which is used during tech-
nology mapping.
142 Hardware generation from the polyhedral model
Table 5.4: Equations which can be used to simplify algebraic expressions.
Let a, d ∈ N, b, c ∈ Z, x, y ∈ Q then:
⌊
ab+ x
ad
⌋
=
⌊
ab+ ax′
ad
⌋
=
⌊
b+ x′
d
⌋
, with x′ =
⌊x
a
⌋
(5.1)
⌈
ab+ x
ad
⌉
=
⌈
ab+ ax′
ad
⌉
=
⌈
b+ x′
d
⌉
, with x′ =
⌈x
a
⌉
(5.2)
max (⌈x⌉ , ⌈y⌉) = ⌈max (x, y)⌉ (5.3)
min (⌊x⌋ , ⌊y⌋) = ⌊min (x, y)⌋ (5.4)
max
(
0,
⌈
ab− x
d
⌉)
= 0 ⇔ ab− x ≤ 0 (5.5)
⇔ ab ≤ x
⇔ b ≤ x′ , with x′ =
⌈x
a
⌉
(5.6)
⌈x
d
⌉
> c⇔ x > cd (5.7)⌊x
d
⌋
< c⇔ x < cd (5.8)
b < c⇔ b ≤ c+ 1 (5.9)
used to prove
max
(
0,
⌈−i− 2
7
⌉
,
⌈−i
3
⌉)
= 0 .
In a similar way, using i ≥ 0 and M ∈ Z, the upper bound on j can be
simplified:
min
(⌊
i+ 7M − 5
7
⌋
,
⌊
4M − 1
4
⌋
,
⌊
i+ 3M − 3
3
⌋)
= M +min
(⌊
i− 5
7
⌋
,
⌊−1
4
⌋
,
⌊
i− 3
3
⌋)
= M − 1 .
A similar but more complex case is found in the outer loops of RC
HV (Figure 5.15(a)). A study on the ranges of the different terms inside
max and min, using (5.7) and (5.8), leads to different cases depending
5.4 Simplification of loop bound expressions 143
1: for i = 0,
⌊
8N+63
32
⌋
2: for j = max
(
0,
⌈
60i−15N−44
68
⌉
,
⌈
512i−128N−975
16
⌉
,⌈
−32i−27
4
⌉
,
⌈
28i−7N−20
36
⌉)
,
min
(⌊
24i+5M+36
20
⌋
,
⌊
136i+31M+224
124
⌋
,
⌊
4M+47
16
⌋)
3: . . .
(a)
1: for i = 0,
⌊
N−1
4
⌋
+ 2
2: for j = (4i−N < 3 ? 0 : (4i−N > 7 ? 2 : 1)) ,(
i == 0 ? 2 +
⌊
M−1
4
⌋
: 3 +
⌊
M−1
4
⌋)
3: . . .
(b)
Figure 5.15: Part of the code of RC HV, generated with the CLooG option “-f
-1”, before (a) and after (b) algebraic simplifications of the loop bounds.
Table 5.5: The bad synthesis results in the block-based variants are improved
after algebraic simplification of the loop bounds in the code generated with
the option “-f -1” (c). (a) and (b) are taken from Table 5.2.
LE fmax(MHz)
(a) (b) (c) (a) (b) (c)
RC HV 2449 2533 1021 55.08 11.23 63.01
RC HVs 1768 1820 996 64.21 16.53 78.83
RC VH 2318 2515 1024 54.78 11.58 61.08
VH LB 6563 3218 1784 14.29 9.69 36.81
LB HV 8737 2003 1130 29.71 13.62 57.99
LB VH 21820 3195 2148 4.52 10.73 28.92
LB VHs 3869 2710 1781 13.45 11.62 33.85
on the value of i, three and two cases respectively, as shown in Fig-
ure 5.15(b).
Using the equations in Table 5.4. all variants listed in Table 5.2
can be simplified to only contain divisions that can be implemented
as shifts, i.e. divisions by powers of 2. The synthesis results after these
optimizations are shown in Table 5.5. Now the results are better than
those obtained with the option “-f 1”.
144 Hardware generation from the polyhedral model
The techniques presented in this section to simplify or eliminate in-
teger divisions are an extension to the techniques presented in [149].
That the divisions with dividers that are not a power of 2 can be elim-
inated corresponds to the fact that all strip-mine transformations were
done with factors that are a power of 2. Of course, in the general case it
is not always possible to eliminate such divisions using the techniques
presented here, e.g., when strip-mining is done with a factor 3. In
these and similar cases other methods, such as theMethod of Differences,
should be used [171, 149].
5.5 Optimizing the clock frequency
5.5.1 Loop control
We have mentioned two methods to increase the clock frequency of the
loop control entity: splitting a program into partitions each with their
own loop controller (Section 5.2.1) and inserting extra pipeline registers
between the automata (Section 5.3.1). One of the optimizations done
by synthesis tools is retiming [130, Ch.5], i.e. moving registers across
combinational blocks to minimize the clock cycle (or to minimize the
number of registers). Inserting registers not only splits one path into
two shorter parts, but also gives more retiming opportunities.
To test and compare the two options the example in Figure 5.16 is
considered. The program in (a) is split into four parts in (b), one top-
level loop and three subtree (ST) loop nests. The corresponding hard-
ware architectures are shown in Figure 5.17. The Greek symbols de-
note places where optional registers can be inserted. Synthesis results
are found in Table 5.6. In (a) and (b) no statements are connected to
the loop controller. As a result, 103 pins are used, the major part for
the statement arguments. In (c) and (d) the simple statements defined
in Figure 5.16(c) are connected to the controller and the number of pins
drops to 18. The unused statement arguments are then optimized away.
In some cases this results in higher clock speeds.
It is clear that inserting registers increases the reachable clock
speeds, although, not in all cases. Inserting registers at the highest
level (α) almost always results in a lower frequency. The presence of
extra registers here hinders some optimizations, as can be seen in the
number of logic elements, which raises more than the number of regis-
ters. γ clearly is the best location for optional registers. Therefore, the
5.5 Optimizing the clock frequency 145
Table 5.6: Synthesis results of the architectures in Figure 5.17 for a Stratix
EP1S25F1020C5. Figures without statements (a and b) and with statements
(c and d) for different choices of optional register insertion. (a) and (c) corre-
spond to Figure 5.17(a). (b) and (d) correspond to Figure 5.17(b). The registers
(last column) are contained in the LEs.
(a) (b)
Opt. fmax LE Regs
Regs (MHz)
None 59.79 286 42
α 57.39 297 46
β 69.18 295 47
γ 82.19 289 47
α+ β 68.55 303 51
α+ β + γ 106.12 304 56
α+ γ 74.86 298 51
β + γ 109.63 296 52
2γ 83.21 292 52
β + 2γ 109.10 299 57
Opt. fmax LE Regs
Regs (MHz)
None 68.34 338 65
α 66.45 350 69
β 71.77 348 72
γ 102.30 343 67
α+ β 72.46 360 76
α+ β + γ 104.30 362 78
α+ γ 93.77 355 71
β + γ 109.57 350 74
2γ 99.71 343 69
β + 2γ 107.28 350 76
(c) (d)
Opt. fmax LE Regs
Regs (MHz)
None 55.69 308 46
α 58.05 317 50
β 68.67 324 51
γ 77.36 317 51
α+ β 67.35 333 55
α+ β + γ 105.65 333 60
α+ γ 75.84 326 55
β + γ 109.94 325 56
2γ 78.76 320 56
β + 2γ 105.98 328 61
Opt. fmax LE Regs
Regs (MHz)
None 74.81 384 69
α 72.58 387 73
β 75.77 380 76
γ 116.08 383 71
α+ β 74.07 392 80
α+ β + γ 130.45 393 82
α+ γ 95.17 386 75
β + γ 118.04 381 78
2γ 114.14 383 73
β + 2γ 130.01 381 80
146 Hardware generation from the polyhedral model
for(i=0;i<=N;i++){
for(j=i;j<=N;j++){
S1(i,j) ;
}
for(j=0;j<=2*N-1;j++){
S2(i,j) ;
S3(i,j) ;
}
for(j=-N;j<=0;j++){
for(k=i-j;k<=2*N;k++){
S4(i,-j,k) ;
}
}
}
// Subtrees:
#define S1st(i) \
for(j=i;j<=N;j++){ \
S1(i,j) ; \
}
#define S2st(i) \
for(j=0;j<=2*N-1;j++){ \
S2(i,j) ; \
S3(i,j) ; \
}
#define S3st(i) \
for(j=-N;j<=0;j++){ \
for(k=-j+i;k<=2*N;k++){\
S4(i,-j,k) ; \
} \
}
// Top:
for(i=0;i<=N;i++){
S1st(i) ;
S2st(i) ;
S3st(i) ;
}
(a) (b)
-- #define S1(i,j)
i:=S1_arg_0; j:=S1_arg_1;
-- #if (j>=i+2)
S1_out <= ’1’;
-- #else
S1_out <= ’0’;
-- #define S2(i,j)
i:=S2_arg_0; j:=S2_arg_1;
-- #if (i>=j)
S2_out <= ’1’;
-- #else
S2_out <= ’0’;
-- #define S3(i,j)
i:=S3_arg_0; j:=S3_arg_1;
-- #if (i>=j)
S3_out <= ’0’;
-- #else
S3_out <= ’1’;
-- #define S4(i,j,k)
k:=S4_arg_2;
-- #if (k>=9)
S4_out <= ’0’;
-- #else
S4_out <= ’1’;
(c)
Figure 5.16: Program example written as a single loop nest (a) and split into
one top-level and three sub-level loop nests (b). Statements are defined as
Steps2Process input (c).
5.5 Optimizing the clock frequency 147
ID2
ID1
for 1
for 2
S4S2 S3S1
ID3
ID0
for 3
β
γ
α for 1
ID1’’’ID1’
ID2’’’
ID0’ ID0’’ ID0’’’
ID1
ID0
for 1’’
ID1’’
S1 S2 S3 S4
for 1’ for 1’’’
for 2’’’
α
β
γ
ST2ST1
ST3
Top
(a) (b)
Figure 5.17: Hardware architectures for the example in Figure 5.16 (a) and (b).
α, β and γ denote places where optional registers can be inserted.
148 Hardware generation from the polyhedral model
option to place double registers at that location was tried (2γ, β + 2γ).
Note that retiming is applied after placing the registers, which may
insert combinational logic between the registers of such a pair.
In the synthesis results of designs generated with CLooGVHDL we
observed that in the logic elements the utilization of LUTs is much
higher than the usage of registers. This means that registers may be
inserted without extra hardware cost and without the need to rerun
technology mapping tools. This makes it possible to integrate the op-
tional register insertion with the retiming optimizations done during
place and route. In this thesis the register insertion options are explored
exhaustively. After integration with retiming and place and route tools,
timing information becomes available that may be used to guide the in-
sertion process and avoid an exhaustive scan.
Composing the loop controller out of four subcontrollers increases
the area usage in a larger amount than placing registers, but also leads
to higher clock frequencies. The effect is larger with the statements
connected and depends heavily on the inserted registers.
The influence on the number of clock cycles is not shown here since
the relative impact depends on the value ofN , which can be chosen ar-
bitrarily in this artificial example. The number of cycles will be studied
in the IDWT case study (Table 5.10).
5.5.2 Data path
Since the user has more control over the construction of the data path
all classical methods can be used here. This has been demonstrated in
Figure 5.12 by spreading the computations over two cycles to shorten
the critical path.
5.6 Case study: exploration of IDWT variants
In this section hardware implementation variants of a 2-D IDWT are
generated and compared. A simple memory architecture assuming one
memory for each array and short fixed access times, as if it were on-
chip memories, are considered (Figure 5.13). In a final implementation
an external memorywill be needed as the used data sets are too large to
fit in the FPGAmemory blocks. Nevertheless, the performance of these
systems without memory hierarchy are a good measure for the perfor-
5.6 Case study: exploration of IDWT variants 149
mem
mem
mem
statements
datapath + control
data
control
fetch/store
control
cl
k 
1
cl
k 
2
loop control
main mem
Figure 5.18: Architecture with memory hierarchy. Fetching and storing data
is done in parallel with the processing. The execution speed is determined
by the slowest of these two processes. Multiple local buffers allow parallel
memory accesses.
HW HW
Manual intervention
schedule
optimizations
Hierarchy
Memory
Build
CLooGVHDL
Steps2process
statements.vhd
loop.vhd
steps.vhd
VIM−scripts
defines.h
loop.cloog
Figure 5.19: Design flow used to generate hardware implementation variants
of the 2-D IDWT.
mance of a final implementation with memory hierarchy as long as the
system is not bandwidth limited. By replacing the memories by dual-
port memories, using the other port to transfer data to and from the
external memory (Figure 5.18) such that at each time the needed data is
available in the on-chip memories, the execution times of the statement
invocations stay the same as without memory hierarchy (expressed as
the number of clock cycles since the clock frequencywill probably drop
when the design is extended and more FPGA area is used). Some extra
cycles will be needed for the first fetch operations and the last store op-
erations. All other data transfers may be done in parallel with the state-
ment executions. If the design is bandwidth limited the performance
will be assumed to be proportional with the available bandwidth.
150 Hardware generation from the polyhedral model
Table 5.7: Amount of data exchange (in pixels) to/from themain memory (left
side of Figure 5.18) for different IDWT variants. k=number of transformation
levels, R/C=number of rows/columns of image.
Variant Amount of data exchange Burst usage
k ∈ N k = 3
RC 16
3
RC(1− 1/4k) 5.25RC 50%
LB 8
3
RC(1− 1/4k) 2.625RC 100%
LB V 2RC 2RC 100%
5.6.1 Generating hardware variants
After some preprocessing steps (conversion to dynamic single assign-
ment to eliminate false dependences and unrolling of the outer loop to
remove the exponential expressions) the code can be represented in the
polyhedral model (Section 3.7).
URUK scripts (Section 3.5.2) were written for different loop trans-
formation sequences (Section 3.7 and 3.8), to generate IDWT variants:
untransformed (RC), fused (LB) and fused-and-tiled (LB V).3 How
the bandwidth requirements change by doing loop transformations is
shown in Table 5.7. Here, it is assumed that small data sets, in the
order of lines of an image, can be stored on-chip, but large sets, in the
order of a frame or a subband, have to be stored in external memory,
a.o., the input and final output (both with size RC). LB V reaches the
minimal possible bandwidth requirements, i.e. reading the input and
storing the output. The burst usage indicates the amount of data that
can be transferred in bursts. A row-major order is assumed. Only the
Row-Column-wise IDWT that alternately accesses the data in row and
in column wise has a bad burst mode usage. A bad burst mode usage
reduces the available bandwidth.
For each variant a hardware controller was generated by CLooG-
VHDL and combined with a statements entity, constructed using Vim
scripts [7, 133] and Steps2Process, with or without manual optimiza-
3The used input code is not exactly the same as in Chapter 3. The number of state-
ments differs. 1 statement in Chapter 3 corresponds with 2 consecutive statements in
the code used here. However, the transformation sequences are equivalent.
5.6 Case study: exploration of IDWT variants 151
Table 5.8: Implementations of the IDWT, without borders. Synthesis results
without memories (Figure 5.13 or the right side (clk 2) of Figure 5.18) ob-
tained with Altera QuartusII v6.1 for the Stratix EP1S25F1020C5 (S25 C5) and
EP1S60F1020C6 (S60 C6).
(a) Designs made with CLooGVHDL. Manual optimizations on the
statements were done in designs 4–6.
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
1 RC 3244 18 (9) 187389 56.37 46.20 19.05 15.61
2 LB 3390 18 (9) 200459 52.80 45.60 16.68 14.41
3 LB V 4031 18 (9) 200497 53.67 47.74 16.95 15.08
4 RC 2925 18 (9) 129821 57.67 56.60 28.13 27.61
5 LB 3091 18 (9) 142891 62.56 46.84 27.73 20.76
6 LB V 3155 18 (9) 142929 62.12 56.21 27.53 24.91
(b) Impulse C
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
7 RC 16087 80 (10) 605588 34.39 - 3.60 -
8 RC 8006 128 (16) 605588 - 32.93 - 3.44
(c) Designs generated by Handel-C. In designs 13 and 14 an EDIF netlist was
generated. In the other designs (9–12) VHDL was generated.
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
9 RC 7632 80 (10) 108792 47.02 - 27.37 -
10 RC 1730 144 (18) 108792 - 45.10 - 26.25
11 LB 15179 80 (10) 100356 44.87 - 28.32 -
12 LB 9296 144 (18) 100356 - 41.27 - 26.04
13 RC 3941 0 (0) 108792 70.96 64.30 41.31 37.43
14 LB 5762 0 (0) 100356 66.44 63.65 41.93 40.17
tions. The design flow is shown in Figure 5.19 and synthesis results of
all the designs are listed in Tables 5.8 and 5.9. The frame rates assume
a computation limited design.
To investigate the influence of the complexity due to the mirroring
at the borders, two versions of the code were made. The first does not
calculate the pixels near the borders. The second uses predicated state-
ments to distinguish between pixels at the center and near the border.
This distinction is done inside the statements (cf. Figure 5.12(c)) and
152 Hardware generation from the polyhedral model
Table 5.9: Implementations of the IDWT, with borders. Synthesis results ob-
tained with Altera QuartusII v6.1 for the Stratix EP1S25F1020C5 (S25 C5) and
EP1S60F1020C6 (S60 C6). (* = Number of cycles for CIF resolution instead of
72× 88 pixels)
(a) Designs made with CLooGVHDL. Manual optimizations on the
statements were done in design 18, which was extended with a memory
hierarchy to obtain design 19.
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
15 RC 7278 18 (9) 214269 54.71 50.17 16.17 14.83
16 LB 5589 18 (9) 215848 54.82 49.21 16.08 14.44
17 LB V 5743 18 (9) 215881 53.64 55.88 15.74 16.39
18 LB 10836 18 (9) 161037 50.12 43.40 19.71 17.07
19 LB 17350 18 (9) *934962 47.22 45.60 50.50 48.77
(b) Impulse C
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
20 RC 31288 144 (18) 697431 - 30.60 - 2.78
21 LB 23116 144 (18) 508116 - 38.01 - 4.74
(c) Designs generated by Handel-C. In designs 24 and 25 an EDIF netlist was
generated. In the other designs (22 and 23) VHDL was generated.
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (72×88) S25 C5 S60 C6 S25 C5 S60 C6
22 RC 45722 144 (18) 119712 - 30.81 - 16.30
23 LB 49705 144 (18) 119124 - 31.16 - 16.57
24 RC 11855 0 (0) 119712 53.17 49.90 28.13 26.40
25 LB 12303 0 (0) 119124 52.66 49.90 28.00 26.53
(d) Manual design without (design 26) and with (design 27) memory
hierarchy.
Design Var. LE DSP bl. Cycles fmax (MHz) Frames/s (CIF)
(#Mul) (CIF) S25 C5 S60 C6 S25 C5 S60 C6
26 RC 1823 10 (5) *869530 72.54 68.60 83.42 78.89
27 RC 2202 13 (8) *869530 72.54 68.60 83.42 78.89
not inside the loop control structure. This makes the loop transforma-
tions and control complexity similar for both cases and will only have
an impact on the statement implementations. Design 15 (Table 5.9(a))
is an exception, built from 80 statements instead of 16, and illustrates
5.6 Case study: exploration of IDWT variants 153
Table 5.10: Insertion of registers between control automata increases the clock
speed at the cost of extra clock cycles. Results for the line-based IDWT with
borders on a Stratix EP1S25F1020C5. α: buffer signals t 0, for 1 lc, start for 1
between ID0 and for 1. β: buffer signals t 0, t 1, for 2 lc, start for 2 between
ID1 and for 2. α + β corresponds to design 16 in Table 5.9. Number of cycles
and execution time T for transforming a frame of 72× 88 pixels over 3 levels.
Opt. Regs LE Regs Cycles fmax(MHz) T(ms)
None 5587 336 (= +0) 215366 (= +0) 53.42 4.03
α 5546 342 (= +6) 215388 (= +22) 57.20 3.77
β 5603 344 (= +8) 215826 (=+460) 53.99 4.00
α+ β 5589 350 (=+14) 215848 (=+482) 54.82 3.94
the area advantage of using fewer statements with predicates.
In Section 5.5 the option to insert registers between control au-
tomata has already been studied. Experiments on the line-based IDWT
(Table 5.10) show that inserting registers between high levels is more
beneficial than between low levels. Placing registers at both places re-
sults in a summation of the extra clock cycles and registers. However,
the clock frequency lies somewhere in between. The designs in Tables
5.8 and 5.9 all have registers at both places (cf. α+ β).
Figure 5.20(a) plots the frame rate as a function of the available
bandwidth. When sufficient bandwidth is available, a design is com-
putation limited and the frame rate does not depend on the available
bandwidth. For lower bandwidths a design is bandwidth limited and
the frame rate is proportional with the bandwidth. For different avail-
able bandwidths different designs have the highest performance. The
manual design of Section A.3 (design 26-27) reaches a much higher
computation limited frame rate, but suffers from its bad spatial and
temporal locality. Note that the available bandwidth depends on the
burst mode usage. For the RC-wise IDWT this is the harmonic mean of
the bandwidths of the horizontal and vertical accesses, and thus domi-
nated by the lowest of the two.
Manual optimizations
As can be seen in Figure 3.22 some elements of array A used on line 7
are already used on line 6 and similarly for array B on line 11 and 10.
These non-loop-carried reuses can be exploited by storing the reused
154 Hardware generation from the polyhedral model
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  20  40  60  80  100
CI
F 
Fr
am
es
/s
Bandwidth to external memory (MB/s)
Manual RC
CV LB + H + MEM
CV LB + H
CV LB V
CV RC
Manual RC
CV LB + H
CV LB + H + MEM
CV RC
CV LB V
(a) Manual and CLooGVHDL (CV), (H=hand-optimized statements)
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  20  40  60  80  100
CI
F 
Fr
am
es
/s
Bandwidth to external memory (MB/s)
ImpulseC
Handel-C VHDL
Handel-C EDIF
RC
LB
LB
Impulse C LB
Impulse C RC
Handel-C VHDL LB
Handel-C VHDL RC
Handel-C EDIF LB
Handel-C EDIF RC
(b) Impulse C and Celoxica Handel-C
Figure 5.20: Frame rate as a function of the available memory bandwidth.
Designs with borders on the Altera Stratix EP1S60F1020C6.
5.6 Case study: exploration of IDWT variants 155
elements in registers after the first read and using this value in the next
statement within that iteration. This reduces the number of memory
reads and thus the cycle count.
In the designs described above two memories are used, one for the
input frame consisting of the LL0, HL, LH and HH bands and one
for intermediate results (array B and frames LLl for l > 0). These two
memories can be accessed in parallel which is only done after optimiz-
ing the schedules of the statement implementations.
To shorten the critical path, the computation of the sum of products
is spread over several clock cycles, similar to the optimization of the
computation in Figure 5.12(c).
These three kinds of optimizations were done manually on the
statements of designs 4–6 and design 18.
Only when constructing a memory hierarchy (Section 5.6.3, design
19) the loop-carried reuses will be exploited and the number of external
memory reads is further reduced. This is done by introducing registers
to exploit reuses in the horizontal filtering, FIFOs to exploit reuses be-
tween vertical and horizontal filtering and line buffers to exploit reuses
within the vertical filtering.
5.6.2 Comparison of designs generated with CLooGVHDL,
Impulse C and Handel-C
The implementations made manually and with CLooGVHDL are com-
pared with designs generated by Impulse C (v. 1.22) [4], a commer-
cial tool for automatic synthesis of stream-based applications (Sec-
tion 2.2.2), and the Handel-C compiler (v. 3.5.768.63181) of Celoxica
DK Design Suite (v. 4.0 SP1) [3] (Section 2.2.4)4.
In each input language 18-bit data types are used in the datapath.
However, the VHDL code generated by Impulse C and Handel-C re-
sults in 32-bit multipliers (8 DSP blocks each), while with CLooGVHDL
18-bit multipliers (2 DSP blocks each) are used. It seems that the VHDL
code generators of Impulse C and Handel-C just neglect the data type
information specified in the input code. The Handel-C compiler can di-
rectly generate EDIF-netlists for a target FPGA instead of VHDL. In this
case no embeddedmultipliers but LEs (Logic Elements) are used to im-
plement the multiplication with constants. This has a large impact on
4The author would like to thank Tom Degryse for writing the Handel-C variants.
156 Hardware generation from the polyhedral model
the performance. Two FPGAs were targeted: the EP1S60F1020C6 has
144 DSP blocks, the EP1S25F1020C5 only 80 but it has a higher speed
grade. This results in a shift between used LEs and DSP blocks (designs
7–12).
The coding style of Handel-C has a large influence on the resulting
hardware. To force the reuse of multipliers a function had to be defined
that is called from different parts of the code, e.g., within the predicates
to distinguish between center and borders in designs 22–25. Also the
reuse of variables and the choice of putting directives for sequential
and parallel execution has a large impact and should be done with the
resulting hardware in mind. For each loop transformation this work
has to be redone, while with CLooGVHDL the statements block can be
reused, with the exception of transformation specific manual optimiza-
tions. The designs made with Handel-C have the lowest clock cycle
count. They exploit parallelism between different statements (and in-
vocations), which is not possible (until now) with CLooGVHDL.
The inclusion of the calculations near the borders clearly adds a lot
of complexity to the design. If a processor is available in the system it
may be beneficial to put the border calculations on the processor and
run only the regular non-border operations on the FPGA. However,
adding a processor only for processing the borders could be a large
overkill.
Impulse C gives the worst synthesis results. A single large automa-
ton is generated instead of a factorization into smaller automata, lead-
ing to a low clock frequency. Also the number of clock cycles is much
higher than with the other tools. The states of the control automaton
directly correspond to the macro blocks of the input code. Separate
hardware for each macro block is generated, which results in a bad area
efficiency.
5.6.3 Building a memory hierarchy
From the designs generated in Section 5.6.1 design 18 was selected for a
real hardware implementation.5 To this end, a memory hierarchy with
prefetch/store system was built (Figure 5.21, design 19) to replace the
simple construct of the two memories with fixed access time.
5The bandwidth requirements of RC are too high. LB V can only be used if the num-
ber of transformation levels k is fixed at design time. For the scalable video decoder
(Section A.3.1) the design was made for, k had to be a run-time parameter.
5.6 Case study: exploration of IDWT variants 157
Statements
Control
+ Data Path
Prefetch
Loop
Control
Main
Mem
Prefetch/Store
Requests
Clock Domain Boundary
Control
Data
B2
B1
B3
line buffers
line buffers FIFO
s
Requests
clk1
9x18
4x18
4x1872
128
72
4x18
1x18
72
clk2
I/O
I/O
Figure 5.21: Line-Based IDWT with memory hierarchy (design 19).
To exploit the improved data locality several buffers are inserted.
• Registers are used to exploit the loop-carried reuses in the hori-
zontal filtering. Register promotion techniques [51] can be used
to do this in a systematic way.
• Four FIFO buffers are inserted to transfer data from the vertical
to the horizontal wavelet transformation.
• Line buffers (B1, B2 and B3) are inserted between the main mem-
ory and computation elements.
• A block transfer system copies data from the main memories to
and from the buffers. A queue of prefetch and store requests is
kept in the Prefetch(/Store) Requests entities. A new fetch re-
quest is added each time space becomes available in the buffers
and not just before the data is needed. Therefore, if the system
158 Hardware generation from the polyhedral model
is not bandwidth limited, only in the beginning time is wasted
waiting for data. A block transfer is specified by the source and
target address and the amount of data to be copied (a power of
two times the line length in the smallest subband). A DMA con-
troller (Direct Memory Access) is used to drive the block (=burst)
transfers [78].
• Synchronization points (using “-- #waitndo”) are used to en-
sure that when a line in one of the buffers is accessed, all transfers
between that line and the main memory are finished.
• Some line buffers are split into parallel accessible buffers of one
line to increase the on-chip bandwidth. This results in a large
reduction in the number of clock cycles.
More information on the memory hierarchy construction process
with some considerations influencing implementation choices are
found in [71].
5.6.4 Performance, power and energy analysis
This design (19) and the manual design (27) were integrated on an Al-
tera PCI Development Board with a Stratix EP1S60F1020C6 FPGA and
256 MiB of DDR SDRAM memory, as part of the RESUME scalable
wavelet-based video decoder [70]. An Avalon switch fabric is used
to connect the hardware blocks to the external memory, running at 50
MHz.
The partially generated design transforms up to 53 (gray-scale) CIF
frames/s (higher than in Table 5.9, due to other tool settings). The man-
ually made design reaches only 11.3 CIF frames/s due to the double
amount of transfers to/from the externalmemory and the large amount
of transactions that cannot be done in bursts.
A line-based software implementation reaches 43.5 CIF frames/s on
an AMD Athlon XP 2500+ running at 1830 MHz. This corresponds to
42 million cycles for one CIF frame, 50 times more than the optimized
FPGA implementations. Note that on the FPGA different tasks can run
in parallel with the IDWT while this is not possible on the CPU.
Off-chip memory accesses may not only decrease the performance
but may also increase the energy dissipation. Table 5.11 illustrates this
by comparing the energy and mean power dissipation of two video
5.7 Discussion and conclusions 159
Table 5.11: Comparison of two decoders, each using one variant of the IDWT.
Sequence of 10 GOPs (161 CIF color frames YUV 4:2:0) of Foreman at full qual-
ity.
Variant Time (s) E (J) Pmean (W)
RC IDWT (Manual) 40 25.4 0.64
LB IDWT (CLooGVHDL + ManOpt) 10 11.6 1.16
decoders, each using one of the two IDWT variants mentioned above.
More information on the measurement setup and detailed results can
be found in [70, 81].
5.7 Discussion and conclusions
In this chapter a novel hardware architecture is presented that fits the
polyhedral model. Its modular composition allows to make trade-offs
between area and clock speed and supports extensions towards paral-
lel execution. Integrating the application of loop transformations and
hardware generation makes design space exploration more feasible.
Experiments with two variants of the IDWT showed that a semi-
automatically generated (over-sized) implementation can outperform a
manual design in speed and energy dissipation if bandwidth is a limit-
ing factor. Obviously, doing loop transformations at the start of a man-
ual design process would have led to better results. However, in this
case doing a search space exploration would have been much harder.
We have shown that loop transformations not only influence the
data locality but also the control complexity. In some cases optimiza-
tions on the loop bound expressions generated by CLooG are possible
and needed. For this we have presented techniques that are an exten-
sion of existing loop bound simplification methods.
Is it a good choice to divorce the loop control from the statement
implementations? An argument pro is the orthogonal development of
both entities, which allows to combine variants of both and explore the
design space when the generation of one of both is not fully automated
yet. An argument contra is that no combined optimizations are pos-
sible. Certainly in the case considered until now, sequential execution
of statement invocations, this is a large drawback. Only the parallelism
160 Hardware generation from the polyhedral model
within statement invocations can be exploited (cf. the lower cycle count
reached with Handel-C). This only gives good results for applications
with large statements. Hence, extending the presented work to include
parallelism among statement invocations is a must.
An option not considered here is to use one of the C-based hard-
ware description languages, described in Chapter 2, as output. A C to
Handel-C converter has already been described in [134]. Since a lot of
user directives are needed and coding style has a large influence on the
final result, it is not sure if this would have been much easier.
A restriction of many high-level synthesis tools is that they can deal
well with themacros and corresponding hardware constructs they offer
in their library, but the inclusion of user made blocks has to be specified
as communication with an external block. Such a block then has to
fit a certain interface or the communication should be described at a
lower level. An ideal high-level synthesis tool should offer the option
to extend the vendor library with user constructs. Such new constructs
should then be specified as
• a C-like construct and behavior,
• a VHDL or hard core implementation,
• schedule or synchronization information,
such that it can be optimally integrated in the schedule process.
Vim scripts, although primitive by nature, offer this extendability in
a simple way. The path to a first implementation is not much smoother,
but the scripts work as a kind of memory for manual optimizations.
They make iterations more feasible where tool support is lacking.
Synthesis tools that try to optimize the clock speed are conservative
in their approach. Retiming and other techniques like duplication of
registers will never change the behavior of a circuit. Here, behavior is
defined in a strict sense including clock cycle count. In the proposed
factorized architecture the insertion of extra registers at certain places is
allowed, even though this changes the cycle count. This could be used
in synthesis tools to extend the current clock optimization methods.
In this chapter we have presented methods to more easily explore
the design space by shortening the path to a certain point (implemen-
tation) in that space. Besides this, methods are needed that aid in pre-
dicting the results of design choices, such that less points in the design
space really have to be visited.
Chapter 6
Conclusions
and future work
In this chapter the conclusions drawn from this dissertation are summarized.
Directions for future research based on the contributions of this work are high-
lighted.
6.1 Conclusions
Many methods and tools have emerged that automate part of the tasks
done by a hardware designer. Nowadays, register-transfer-level speci-
fications are considered as synthesizable. High-level synthesis tools try
to further raise the abstraction level of hardware descriptions. Typi-
cally, to have good results, the user still has to be aware of the architec-
tural choices made by such tools. However, if we want to fully benefit
from the computational power that future technologies will offer, there
is no alternative to raising the abstraction level to avoid a design bot-
tleneck.
A high computational power is useless if time is wasted waiting
for data transfers. Loop transformations are indispensable to tackle the
memory bottleneck, both for software and hardware implementations.
The polyhedral model offers a perfect means to automate the applica-
tion of loop transformations. However, the choice of which transfor-
mations to apply is much less straightforward to automate. Two trends
exist. The first tries to iterate over a large number of possible transfor-
mations, and evaluates the result of each transformation to guide the
162 Conclusions and future work
further exploration. Heuristics are needed to prune the search space to
be scanned. The second uses directives delivered by the user. These
may be pragmas or scripts composed of loop transformation steps. In
the context of the latter, this dissertation presented the option to allevi-
ate the task of constructing long transformation sequences by combin-
ing primary (sub)sequences.
Bounds on (piecewise) (quasi-)polynomials over discrete domains
are needed to statically evaluate properties of a program obtained after
loop transformations. To this end, several novel methods to find such
bounds are presented in Chapter 4 and compared with existing meth-
ods. The presented methods take advantage of the fact that for large
domains or small degrees the continuous-domain extrema of polyno-
mials are a good approximation of the discrete-domain extrema. For
small domains the straightforward method of evaluating the (quasi-)
polynomial in each point is still faster, but this solution does not scale
for larger domains. We have proved that with a simple selection mech-
anism, which introduces almost no overhead, hybrid methods can
be constructed that combine the strengths of different methods. With
more complex selectionmechanisms less additional benefit is expected.
Experiments demonstrate that the techniques can be used to evaluate
properties of algorithm implementations with different schedules and
thus can be used to guide the loop transformation process.
The class of applications that benefit from implementation on a
FPGA corresponds well with the class of applications that can be de-
scribed within the polyhedral model. Since loop transformations not
only influence the data access pattern but also the control complex-
ity, integration of loop transformations and hardware generation is
needed to speed up the design space exploration. Therefore, we have
presented a novel hardware architecture with close correspondence to
the polyhedral representation. The architecture consists of a loop con-
troller entity, composed of different automata, and a statements entity.
With CLooGVHDL a hardware description of such a loop controller is
generated directly after applying loop transformations. The factorized
implementation offers several options for area-speed trade-offs, allows
independent optimization of both entities, and supports the extension
towards a parallel implementation. The IDWT case study has proved
that the integration of loop transformations and hardware generation
indeed eases the design space exploration.
6.2 Directions for future research 163
6.2 Directions for future research
Every answer to a question raises ten new questions to answer. As
a consequence, the conclusions have to be followed by new research
questions.
A lot ofwork is still neededbefore loop transformationswill be fully
automated. It is not clear if the solution will exist of a set of transforma-
tions applied in sequence or of more abstract methods that try to find
the optimal schedule functions. Trying to automate the combination of
primary sequences already seems a challenging task.
The arise of divisions with a high implementation cost in code gen-
erated from the polyhedral model points to the fact that the current
code generation tools are not optimal yet. In this thesis some methods
to simplify loop bounds were presented and applied manually. Au-
tomation of these methods and integration with code generators is not
done yet. However, probably it is possible to avoid the generation of
such expressions (at least in cases similar to those that have arisen in
this thesis) by adapting the way in which the code is generated, instead
of adding a postprocessing step. This would be a cleaner solution.
The proposed methods to find bounds on quasi-polynomials only
aid in evaluating a design property. A next step is the integration in
methods to guide a search space exploration.
Bernstein expansion offers the possibility to deal with parametric
polynomials. This option has not been used yet in the context of the
presentedwork. For nonparametric problems Bernstein expansionmay
lead to large overestimates. Subdivision of the domains may improve
the accuracy but does not look trivial for arbitrary polyhedral domains.
Alternatives for the Bernstein expansion may also be considered.
The tools described in Chapter 5 are only a small step towards fully
automated hardware generation. Parallelism between statement invo-
cations has to be supported. For the construction of the data path a
real input parser and scheduler are needed, together with methods to
construct a memory hierarchy that supports the available parallelism
at the lower levels. The impact of the choice to split loop control from
data path has to be investigated further. Possibly, hybrid architectures
may be constructed.
Retiming methods may be extended to allow the insertion of reg-
isters at some predefined places. This would relieve the constraint of
preserving clock cycle true behavior.
164 Conclusions and future work
When using a high-level synthesis tool, interfaces are mostly re-
stricted to those offered by the vendor, or have to be implemented at
a lower level, which removes the advantage of using such tools. The
development of an extendible high-level synthesis tool would be an
option. In such a tool user built constructs and vendor library blocks
would be dealt with on equal terms.
A lot of the optionsmentioned above try to shorten the time to reach
a certain point in the design space. However, complementary methods
are needed that estimate the impact of design choices without actual
implementation. As a result, fewer points in the design spaces would
have to be visited.
Appendix A
The discrete wavelet
transform and its inverse
This appendix provides more details on the Discrete Wavelet Transform
(DWT) and the Inverse Discrete Wavelet Transform (IDWT). The focus is on
the implementation aspects. For more information on where, why and how
wavelets are used the reader is referred to [48] and [121].
A.1 The 1-D discrete wavelet transform
A.1.1 Filters
A filter F is a linear time invariant operator, which is completely deter-
mined by its impulse response {fk ∈ R | k ∈ Z}. The output signal b
of a filter is obtained as the convolution of the input signal a and the
impulse response f :
bn = (f ∗ a)n =
ke∑
k=kb
an−kfk , (A.1)
with kb and ke the smallest and largest integer number for which fk is
non-zero1. F is causal if kb ≥ 0. The filter is a Finite Impulse Response
(FIR) filter in case only a finite number of filter coefficients fk is non-
zero. The Z-transform of a FIR filter F is a Laurent polynomial F (z)
1kb may be smaller than zero (non-causal filter). This avoids the use of an arbitrary
time shift (factor zi) in Equation A.5 and A.6. In a physical implementation all filters
are made causal by a time shift of the impulse response.
166 The discrete wavelet transform and its inverse
2
2H˜
G˜
G˜
H˜
2
2
X FIR filter X 2 Remove odd samples
L0
L1
L2
H2
H1
. . .
Figure A.1: Block diagram of the 1-D Discrete Wavelet Transform (1-D DWT).
given by
F (z) =
ke∑
k=kb
fkz
−k . (A.2)
The notations F, F and F (z) are often used to denote both the filter and
the associated polynomial.
If l denotes the length of a filter (l = ke − kb + 1), then the compu-
tation of one element of the filter output requires l multiplications and
l − 1 additions. We will further assume kb = 0 and ke = l − 1. For a
symmetric causal filter (fk = fl−1−k, k = 0, 1, . . . , l−1) formula A.1 can
be rewritten as
bn =
l−1∑
k=0
an−kfk (A.3)
=


∑ l
2
−1
k=0 (an−k + an−l+1+k)fk if l is even
∑ l−1
2
−1
k=0 (an−k + an−l+1+k)fk + an− l−1
2
f l−1
2
if l is odd
(A.4)
which uses only
⌈
l
2
⌉
multiplications and l − 1 additions. For antisym-
metric filters (fk = −fl−1−k) the same number of operations applies.
Symmetric and antisymmetric filters are linear phase filters.
A.1.2 The convolution-based 1-D discrete wavelet transform
The general block diagram of a forward wavelet transform is shown
in Figure A.1. The input signal, L0, is sent through two FIR filters, a
low-pass filter H˜ and a high-pass filter G˜, called the analysis filters. The
A.1 The 1-D discrete wavelet transform 167
G
H
2
2
2
2
H
G
2
H1
L1
L2
H2
L0
. . .
Adder Insert one zero after each sample
Figure A.2: Block diagram of the 1-D Inverse DiscreteWavelet Transform (1-D
IDWT).
results are subsampled by a factor two (removing the odd samples), re-
sulting in the signals,H1 and L1, the first level subbands of the wavelet
transform. This process is repeated on the L subband of each level un-
til the kth level is reached. The Inverse Discrete Wavelet Transform
(IDWT) uses the synthesis filters H and G, preceded by upsampling
with a factor two (inserting a zero after each sample) (Figure A.2). The
conditions (necessary and sufficient) for perfect reconstruction of the
original input signal are given by
H(z)H˜(z−1) +G(z)G˜(z−1) = 2 (A.5)
H(z)H˜(−z−1) +G(z)G˜(−z−1) = 0 . (A.6)
The symbols A, a or s (approximation, signal) and D or d (detail, dif-
ference) are sometimes used instead of L and H . These names refer to
an important property of the wavelet subbands. The L subbands form
a low-resolution approximation of the original signal. This approxima-
tion can be made more accurate by adding information from the detail
signals.
Example 12. The Haar Wavelet is the simplest example of a wavelet trans-
form. The analysis filters are given by
H˜(z) =
1
2
+
1
2
z−1
G˜(z) = −1 + z−1 .
Let L0 = (a0, a1, a2, . . .), then
L1 =
(
a1 + a0
2
,
a3 + a2
2
, . . .
)
168 The discrete wavelet transform and its inverse
z−1
2
2
Odd
Samples
Even
Samples
G˜e
G˜o
H˜e
H˜o
L0
L1
H1
Figure A.3: Polyphase decomposition of the analysis filters to avoid the com-
putation of results, removed by subsampling.
Go Samples
Even
Ge
He
Ho Samples
Odd
L0H1
L0L1
Figure A.4: Polyphase decomposition of the synthesis filters to avoid multi-
plications with zero.
and
H1 = (a1 − a0, a3 − a2, . . .)
are simply the means and the differences of each pair of input samples. It is
easy to see that L0 can be reconstructed using the synthesis filters
H(z) = 1 + z−1
G(z) = −1
2
+
1
2
z−1 .
A straightforward implementation of the diagram in Figure A.1 and
A.2 leads to superfluous operations. Half of the filtering results are
thrown away by subsampling, and half of the multiplications in the
filters after upsampling are multiplications with zero. A filter can be
A.1 The 1-D discrete wavelet transform 169
P U
P
Split
Samples
Even
Samples
Odd
P H1
L1
L0
One lifting step
+
−
+
+
Figure A.5: Lifting scheme of the DWT with one lifting step.
decomposed into an even and an odd phase as follows:
F (z) = Fe(z
2) + z−1Fo(z
2) . (A.7)
Using the even and odd phases of the wavelet filters and splitting the
signals into even and odd samples, the number of operations can be
reduced. This results in the diagrams of Figure A.3 and A.4.
For the Haar filter (Example 12) this becomes:
H˜e(z) = H˜o(z) =
1
2
, G˜e(z) = −1 , G˜o(z) = 1
He(z) = Ho(z) = 1 , Ge(z) = −1
2
, Go(z) =
1
2
.
A.1.3 The lifting scheme
The number of operations needed to perform a DWT can in many cases
be reduced by reusing subexpressions. This is done in the lifting scheme.
A simple instance is shown in Figure A.5. First, the even and odd sam-
ples are split, as in the left side of Figure A.3. This splitting is called the
lazy wavelet transform. If the input signal has a local correlation struc-
ture, the even and odd subsets will be highly correlated. Given one
of the two sets, it should be possible to predict the other one with rea-
sonable accuracy. The odd samples are predicted by filtering the even
sampleswith a predict filterP. In the case of theHaar waveletP (z) = 1.
The L-subbands have the property that their average is the same as for
the original signal (
∑
k hk = H(1) = 1). This is ensured by the update
stepwith the update filterU. For theHaar waveletU(z) = 12 . For longer
wavelet filters the lifting scheme consists of multiple lifting steps. A de-
tailed explanation of the lifting scheme can be found in [60, 154]. Note
170 The discrete wavelet transform and its inverse
Samples
Even
Samples
Odd
Merge
P
U P
PH1
L1
L0
One lifting step
+
+
−
+
Figure A.6: Lifting scheme of the IDWT with one lifting step.
that the decomposition of a wavelet transform into lifting steps is not
unique.
A lifting scheme can always be inverted. It suffices to undo the
update and predict steps and merge the even and odd samples (Fig-
ure A.6).
A.1.4 Computational complexity
For the convolution-based DWT the number of operations is easily de-
rived by counting the operations of the FIR filters as shown in Ta-
ble A.1. The notations lG and lH are used for the lengths of the (syn-
thesis) wavelet filters. After adding tildes, the equivalent expressions
in lG˜ and lH˜ for the analysis filters are obtained. For Daubechies filters
lG + lH ∈ 4Z. Therefore, the filters have both even or both odd length.
A.2 The 2-D discrete wavelet transform
The 2-dimensional DWT (Figure A.7) consists of consecutively apply-
ing the 1-D DWT in the horizontal and in the vertical direction. First,
each row of the original image (LL0) is transformed horizontally result-
ing in the subbands L1 and H1. Next, this result is filtered vertically in
a similar way to generate the four subbands LL1, LH1,HL1 and HH1.
The same steps are repeated recursively on each LL-subband until level
k. This variant of the DWT is called level-by-level, row-column-wise
(RC-wise). In a similar way the DWT can be defined for higher dimen-
sions [112, 146]. Analogously, the inverse transformation performs a
1-D IDWT along the columns and the rows of the different subbands.
A
.2
Th
e
2
-D
dis
c
rete
w
a
v
elettra
n
sfo
rm
171
Table A.1: Number of operations of the 1-D DWT or IDWT for each 2 samples. The formulas assume lifting steps of degree
one, i.e., predict and update filters with length two. In particular cases, the numbers can differ slightly as shown for the (9,7)
filter pair. lH and lG denote the lengths of the analysis or synthesis wavelet filters. Formulas and numbers based on [60, 106].
Convolution-based Lifting-based
Mul Add Mul Add
FIR Filter General l l − 1(Anti-)symmetric ⌈ l2⌉
Daubechies
Wavelets
General lH + lG
lH + lG − 2
⌈
lH+1
2
⌉
+
⌈
lG+1
2
⌉
+ 3 ⌈
lH+1
2
⌉
+
⌈
lG+1
2
⌉
+ 1
Odd, Linear phase lH+lG2 + 1
lH+lG
4 + 1
Even, Linear phase lH+lG2
⌈
lH+1
2
⌉
+
⌈
lG+1
2
⌉
+ 3
Haar 1 2 1 2
Daubechies (9,7) 9 14 6 8
172 The discrete wavelet transform and its inverse
LL0 H1L1
LH1 HH1
HL1LL1
DWT
IDWT
DWT
IDWT
Hor. xf Vert. xf
LL1 H2L2
LH2 HH2
HL2LL2
DWT
IDWT
DWT
IDWT
Hor. xf Vert. xf
LH1 HH1
HL1LH2 HH2
HL2LL2
(a) (b)
Figure A.7: Graphical representation of the 2-D (I)DWT over 2 levels. The
2-D DWT first transforms the original image horizontally, line by line. The
gray box in LL0 represents the pixels that are read to calculate the elements
indicated by the gray boxes in L1 and H1. The arrows indicate the scanning
direction and their sizes indicate the scanning rate. This rate differs by a factor
2 due to the subsampling. Next, the results are filtered vertically, column by
column as indicated by the black boxes. These two steps are repeated on the
LL1 subband. The resulting subbands can be stored together within one data
set with the same dimensions as the original frame (b). The 2-D IDWT consists
of doing the inverse of each step.
Figure A.8 shows the DWT applied on the Lena image, a standard
reference image for image processing. To remove the negative values
in the LH , HL and HH subbands the absolute value is taken. Vertical
lines are visible in the HL subbands and horizontal lines in the LH
subbands corresponding with spatial high-frequent components in the
horizontal and vertical direction, respectively.
Code for the 2-D inverse transformation in the specific case of a (9,7)
filter pair (lG = 9, lH = 7) is shown in Figure 3.22. Corresponding to
Figure A.4, the code in Figure 3.22 contains lines for the even and lines
for the odd elements.
Some indices get values outside the image boundaries. A way to
extend the image is needed. This can be done by adding zeros, mir-
A.2 The 2-D discrete wavelet transform 173
(a) (b)
Figure A.8: The Lena reference image before (a) and after (b) the discrete
wavelet transform over 2 levels. In (b) the absolute value of the wavelet coef-
ficients is used.
roring the pixels near the border or extending the image with a linear
combination of the pixels near the border [113].
A.2.1 Computational complexity
The number of operations needed to transform an imagewith sizeR×C
over k levels is given by
#oper.
frame
=
k∑
i=1
1
4i
(
AR
C
2
+AC
R
2
)
= ARC
1− (14 )k
3
4
, (A.8)
with A the number of operations in the 1-D DWT needed for each two
samples, depending on the filter lengths and the kind of operation (Ta-
ble A.1).
When the (I)DWT is included in a multimedia system and has to
transform a stream of images, the required number of operations per
second is calculated by multiplication with the framerate, i.e. the num-
ber of frames per second. If color images represented in the YUV 4:2:0
format are used a factor 3/2 is added since the pixel luminance (Y
frames) is given at full resolution while the chrominance information
(U and V frames) is only available at a quarter of the resolution (1 +
174 The discrete wavelet transform and its inverse
1/4 + 1/4 = 3/2). This results in
#oper.
s
=
3
2
#oper.
frame
× framerate . (A.9)
A.3 Manual design of a RC-wise IDWT
A.3.1 Specifications
Part of the work presented in this thesis has been donewithin the scope
of the RESUME project (Reconfigurable Embedded Systems for Use in
scalable Multimedia Environments, 2003–2006, IWT grant 020174, [9]).
An overview of the project achievements is found in [80].
This project investigated, a.o., the usefulness of reconfigurable
hardware (FPGAs) for decoding scalable video. Scalable video means
that the Quality of Service (QoS), i.e. the image quality, frame rate,
resolution and color depth of the decoded video, can be freely adapted
without having to re-encode the video stream or having to decode the
whole video stream if only a lower quality version is required. As a
result, the video server only has to encode and store a single video
stream for all different kinds of decoding platforms which can have
varying computational power, network bandwidth or screen types.
Within the RESUMEproject a scalable wavelet-based video decoder
has been built with the requirements to decode a video with CIF reso-
lution (288× 352 pixels, YUV 4:2:0 format) in real-time, i.e. 30 frames/s
(originally, later lowered to 25 frames/s). An overview of the used
codec algorithm is found in [77, 80]. This led to the following design
specifications for the IDWT: an IDWT with a 9/7 filter pair, able to
transform 45 gray-scale CIF frames/s over 3 levels (more or less equiv-
alent to 30 color frames/s using YUV 4:2:0). Within these constraints
minimal area is pursued.
The target platform was an Altera PCI development board with a
Stratix S25F1020C5 FPGA, which contains roughly 25000 LEs and 80
DSP blocks (9-bit equivalents).
A.3.2 Modular design concept
To make the design reusable on different HW platforms a modular de-
sign concept is used (Figure A.9). The assumption is made that the used
A.3 Manual design of a RC-wise IDWT 175
modules
IDWT
v. 2
I/O
specific
Application
protocol
modules
IDWT
v. 1
I/O
Protocol 2
ReuseAdapt
Protocol 1
HW Platform 1
HW Platform 2
Figure A.9: Using separate I/O-modules makes it easy to put a design on
different platforms with different interfaces.
HWplatform (or a potential future platform) offers a way to plug-in IP-
cores or other HWmodules when they satisfy a given interface (IF). Ex-
amples are the open standard AMBA R© protocol [17] and the Avalon R©
Memory-Mapped interface used by the Altera SOPC builder [15].
This interface is platform dependent. Therefore, a core has to be
adjusted to the platform it is plugged into. With the introduction of
platform specific I/O-modules that translate a platform independent,
but application specific protocol into the platform specific protocol, the
needed adjustments are localized and the functional modules, here the
IDWT, can be reused without changes.
More information on the used design flow is found in [69].
A.3.3 Architecture
The architecture is shown in Figure A.10. Some small control ports and
parameter inputs, such as the start and ready signal, R, C , k and the
filter coefficients, are not shown.
From the design specifications the required number of operations
per second is estimated using formula A.9 and is slightly less than 54
million multiplications and 84 million additions per second. Assum-
ing a clock frequency of 50 MHz (proved feasible for other designs)
this should be reachable with two multipliers. In the final design five
multipliers are used as explained below.
176 The discrete wavelet transform and its inverse
RAMB
RAMO
RAMI
M
a
in
M
e
m
1D-ctrl
2D-ctrlI/O-ctrl
Data
Control
Clock Domain Boundary
FIR1 FIR2
1D-IDWT
cl
k
id
w
t
cl
k
dd
r
2D-IDWT
Figure A.10: Architecture of the manually designed IDWT.
The used 9/7 filter pair leads to the following filter lengths:
lG = 9, lH = 7 ,
lGe = 5, lGo = 4, lHe = 4, lHo = 3 .
Two FIR filters (HW modules) are used. FIR1 has length 4 and imple-
ments Go and He, successively. FIR2 has length 5 and implements Ge
and Ho (extended with a 0 in front and at the end to have length 5).
Both FIR filters are symmetric and thus take 2 and 3 coefficients, re-
spectively. The filter coefficients can be altered at run-time. In a first
execution phase, the L samples are filtered with He and Ho and the re-
sults are stored in RAMB. In a second phase, theH samples are filtered
with Ge and Go and the results are added to the results stored in RAMB.
The filter coefficients of G and H are thus loaded alternately into the
two FIR filters. This scheme (a 1-D IDWT) is repeated for every column
and then for every row and this for all transformation levels.
The in- and output buffers, RAMI and RAMO, can store 2 rows or
columns of a frame or subband (Figure A.11(a)). These buffers are two-
port memories. Therefore, transactions to the main memory can take
place in parallel with accesses of the calculation units. After the first
A.4 Image quality 177
0
1
RAMO
L
L
H
H
0
1
RAMI
O0
O1
I0
I1
I2
I3
C0
C1
C2
.
.
.
.
.
.
.
.
.
(a) (b)
Figure A.11: I/O buffers and partial ordering of memory transactions (I and
O) and calculations (C).
column (row) (numbered 0) is fetched into line 0 of RAMI (transaction
I0), the computations can start (C0) while the second column (row) is
fetched into line 1 (I1). The next column (row) can only be fetched into
line 0 (I2), if this line does no longer contain needed data, i.e., if C0 is
finished. For the output a similar dependence exists. This leads to a
partial ordering as shown in Figure A.11(b).
To implement the mirroring at the borders, needed to have enough
input samples as mentioned on page 173, the address counters first
count down a few samples, then up from the beginning to the end of
the buffered column (row) and then a few samples down again.
Synthesis results are found in Table 5.9, design 26 (without I/O
modules) and design 27 (with I/O modules for direct connection to
a single memory with a one-cycle access time).
A.4 Image quality
If data typeswith infinite accuracy are used the result of doing an IDWT
after a DWT is the original image. However, when the word length
is limited differences may arise which lower the quality of the recon-
structed image. A very common quantity to measure image quality is
the PSNR (Peak Signal to Noise Ratio).
178 The discrete wavelet transform and its inverse
A.4.1 PSNR
For an image withR×C pixels, denoted as aij , the Signal toNoise Ratio
(SNR) is expressed as
SNR = 10 log
(
1
RC
∑
R,C a
2
ij
1
RC
∑
R,C(aij − bij)2
)
(dB) , (A.10)
with bij the pixels of the original (error-free) image. The denominator
is the mean quadratic error and the numerator the mean signal energy.
This is a measure for the relative error. In most cases a measure for
the absolute noise value or error is needed. If the signal value is re-
placed by its maximal possible value (peak), 255 if the range is [0, 255]
or [−255, 255], then the definition of PSNR is obtained:
PSNR = 10 log
(
2552
1
RC
∑
R,C(aij − bij)2
)
(dB) . (A.11)
The PSNR is an objective measure for image quality but does not tell
anything about the subjective quality as perceived by a human ob-
server.
A.4.2 From floating-point to fixed-point
The wavelet transform used in this dissertation was originally written
in software using floating point data types. For a hardware implemen-
tation a conversion to fixed-point was needed. Using 18 bit for the
wavelet coefficients (4 decimals) resulted in perfect reconstruction.
A.4.3 Quality scalability
One of the QoS scalability axes of the RESUME codec is the number
of bits used to represent wavelet coefficients. This allows to make a
trade-off between the computation power needed, the used network
bandwidth and the image quality, expressed as PSNR.
A.5 Related work on DWT variants
Most IDWT variants discussed in Section 3.8 are similar to hardware
architectures reported in the literature. Often different names for the
A.5 Related work on DWT variants 179
Table A.2: An overview of DWT variants found in the literature
Variant Specification Reference
RC [106, 16, 116]
LB level-by-level [116]
multi-level [54, 167, 106, 53, 16, 116]
BB BBRC [167, 16, 116]
Stripe-based LB V (overlap) [106]
Others Pseudo-fractal [115]
same variant are found. In [106] an overlapped and non-overlapped
stripe-based scanmethod are proposed. This corresponds to an (I)DWT
tiled along one dimension. Line-based and block-based architectures
have been the subject of intensive research [54, 167, 106, 53, 16, 116].
The reports illustrate how the buffer size, throughput, latency, energy,
. . . are determined by the data access pattern and thus by what is called
above variant.
The usefulness of a variant does also depend on other blocks in the
system. It is usually beneficial to adjust the data production and con-
sumption pattern of successive modules to each other. By doing this
Lafruit et al. [115] create a 2-D pseudo-fractal space-filling scan curve.
This schedule is not linear and cannot be reached by doing loop trans-
formations. This illustrates one of the limitations of the polyhedral ap-
proach.
To our knowledge none of the articles cited above use high-level
loop transformation tools in their implementation path. There exist
tools [73, 151] that apply a limited set of loop transformations to explore
the area-speed trade-off curve of hardware implementations. However,
these transformations work at a lower level and can only explore dif-
ferent implementations of one variant.
180 The discrete wavelet transform and its inverse
Appendix B
Reuse distances of IDWT
variants
This appendix contains the results of the reuse distance measurements of the
IDWT variants created in Section 3.8 and listed in Table 3.6.
SLO [39] (Section 3.2.2) was used for the reuse distance measure-
ments. By default SLO measures the reference distance, the number
of accesses between two accesses of a reuse pair, instead of the num-
ber of different elements accessed.1 Therefore, the environment vari-
able RDLIB_MEASURE_TRUE_REUSE_DISTANCE was set. The mea-
surements are performedwhile transforming an image of 72×88 pixels
over 3 (= k) levels.
By accumulating the reuse distances larger than a certain cache size
one gets the number of misses of a cache with that size using the LRU
replacement policy. This is done to generate Table B.1, from which Fig-
ure 3.26 is derived.
Note that SLO counts multiple references for multidimensional ar-
rays. E.g., the access to A[i][j], is counted as a read from A[i]
(pointer to A[i][0]) followed by the access to A[i][j] itself, leading
to a count of two accesses. This can be avoided by mapping multidi-
mensional arrays to one-dimensional arrays.
1Sometimes the term stack distance is used to denote the true reuse distance, while
reuse distance is then used to denote the reference distance [38].
182 Reuse distances of IDWT variants
Table B.1: Cache misses based on reuse distances of IDWT variants measured
by SLO, expressed as the number of misses assuming the LRU cache policy
for a cache with size 2i − 1.
i RC RC H RC HV RC V RC VH VH LB
0 931998 931998 931998 931998 931998 931998
1 852582 852262 849903 851486 847388 854033
2 726595 722092 721674 721320 728181 730556
3 291168 288294 287064 290335 284615 284972
4 230320 233585 233728 232979 233377 233468
5 26613 29010 32336 29935 32911 50745
6 26298 28706 29602 27187 30176 42393
7 25671 28110 25423 22903 26069 41241
8 22579 25103 14992 12322 14965 36362
9 10202 12731 14213 11845 14267 15660
10 9690 12079 12991 11032 13006 12595
11 8954 10981 10695 9106 10801 10876
12 7060 9222 7152 6913 7236 6758
13 4868 6035 2114 2103 2120 2008
14 489 0 0 0 0 0
i LB LB H LB HV LB V LB VH
0 931998 931998 931998 931998 931998
1 853202 848196 846701 851340 853636
2 727883 725639 728090 725780 725317
3 291010 287006 285132 291080 288962
4 238930 232089 234891 239041 231493
5 90660 67602 66120 90650 64154
6 38244 43059 43230 38262 41407
7 37907 41176 41506 37955 40493
8 36895 36064 37134 36982 35944
9 33646 14871 16617 34242 16685
10 25102 14409 14521 26942 15571
11 1881 13124 12456 4678 11899
12 1642 8812 8831 2754 5571
13 1119 6508 6748 144 774
14 211 0 0 0 0
Appendix C
Bernstein expansion
C.1 Parametric Bernstein expansion
In the experiments in Chapter 4, Bernstein expansion is used to find
bounds on polynomials over continuous domains. A tutorial on how
Bernstein expansion works and what it is used for is given by [88].
Also [56] contains a clear explanation with examples. In this paper
Clauss et al. have shown how to extend Bernstein expansion to arbi-
trary linearly parameterized polytopes. In this section parametric Bern-
stein expansion is explained with the notation and naming conventions
used in this thesis and illustrated on a small example.
Given a parameterized polytope
Pp = {x ∈ Qn | Ax ≥ Bp+ c } ,
with p ∈ Zm, and a (parametric) polynomial
fp(x) ∈ (Q[p])[x]
of degree d (in the variables x)1, an upper bound for
F c(p) = max
x∈Pp
fp(x)
is computed as follows.
1Q[x] is the set of the polynomials in x over Q. (Q[p])[x] thus means the set of the
polynomials in x with coefficients that are polynomials in p.
184 Bernstein expansion
Table C.1: Polyhedral chambers of the parameterized polytope defined in ex-
ample 3.
Chamber N Parametric vertices
p ∈ [0, 5] 5 (0, 0), (0, 6), (5− p, 6), (5, 6− p), (5, 0)
p ∈ [5, 6] 4 (0, 0), (0, 11− p), (5, 6− p), (5, 0)
p ∈ [6, 11] 3 (0, 0), (0, 11− p), (11− p, 0)
• Recall that the parametric vertices of a parameterized poly-
tope may only exist for a subset of the parameter values (Sec-
tion 3.3.3) [119]. Therefore, the parameter domain, D = {p | Pp 6=
∅}, is divided into chambers such that within each chamber C , Pp
has a fixed set ofN parametric vertices vi(p). Note thatN may be
different in different chambers.
For the polytope defined in example 3 on page 36, there are three
chambers as listed in Table C.1.
• In each chamber, the polynomial is written in terms of the bary-
centric coordinates of the polytope (x =
∑
i αivi(p), 0 ≤ αi ≤ 1,∑
i αi = 1),
f˜p(α) = fp
(∑
i
αivi(p)
)
.
For example, in the third chamber listed in Table C.1, i.e. p ∈ [6, 11],
one can write
(x, y) = α1(0, 0) + α2(0, 11 − p) + α3(11 − p, 0)
= (11− p)(α3, α2)
with
∑
i αi = 1 and 0 ≤ αi ≤ 1. In terms of these barycentric
coordinates the polynomial fp(x, y) = pxy becomes
f˜p(α) = p(11 − p)2α2α3 .
• This polynomial is rewritten in terms of generalized Bernstein ba-
sis polynomials
f˜p(α) =
∑
k1,k2,...,kN≥0
k1+k2+···+kN=d
bdk(p)B
d
k(α)
C.1 Parametric Bernstein expansion 185
where the Bernstein polynomials are defined as
Bdk(α) =
(
d
k
)
αk =
(
d
k1, k2, . . . , kN
)
αk11 α
k2
2 · · ·αkNN ,
with (
d
k1, k2, . . . , kN
)
=
d!
k1!k2! . . . kN !
the multinomial coefficients.
For the example above this becomes
f˜p(α) = p(11 − p)2α2α3 = p(11 − p)
2
2
B2(0,1,1) ,
with
B2(0,1,1)(α) =
(
2
0, 1, 1
)
α2α3 = 2α2α3 ,
and thus
b2(0,1,1) =
p(11− p)2
2
b2(2,0,0) = b
2
(0,2,0) = b
2
(0,0,2) = b
2
(1,1,0) = b
2
(1,0,1) = 0 .
• Since
1 = (α1 + α2 + · · · + αN )d
=
∑
k1,k2,...,kN≥0
k1+k2+···+kN=d
(
d
k1, k2, . . . , kN
)
αk11 α
k2
2 · · ·αkNN
=
∑
k1,k2,...,kN≥0
k1+k2+···+kN=d
Bdk(α) ,
the Bernstein polynomials sum to one for all α.2 From this and
the fact that they are nonnegative for 0 ≤ αi ≤ 1, the minimal
2This can also be derived from the fact that the Bernstein polynomials can be found
in the probability mass function of the multinomial distribution. In probability theory,
the multinomial distribution is a generalization of the binomial distribution. Themulti-
nomial distribution is a discrete distribution which gives the probability of choosing a
given collection of n items from a set of k items with repetitions and the probabilities
of each choice given by p1, . . . , pk. These probabilities are the parameters of the multi-
186 Bernstein expansion
and maximal resulting Bernstein coefficients bdk(p) bound the poly-
nomial, i.e.
min
ki≥0P
ki=d
bdk(p) ≤ fp(x) ≤ max
ki≥0P
ki=d
bdk(p)
for all p ∈ C , x ∈ Pp.
For the working example we get
0 ≤ fp(x) ≤ p(11 − p)
2
2
for all p ∈ [6, 11]. The exact upper bound is p(11−p)24 . In this ex-
ample the Bernstein expansion gives an overestimate of a factor
2.
• Redundant Bernstein coefficients, i.e. those that are (clearly)
never larger (max) or smaller (min) than any other Bernstein
coefficient, are removed.
C.2 Incremental Bernstein expansion
In an incremental Bernstein expansion the Bernstein coefficients are cal-
culated in a recursive way. If part of the variables are considered as pa-
rameters the Bernstein coefficients obtained will be functions of these
variables and can be expanded in their turn.
For example, we consider all but one variable to be extra parameters
and compute the parametric vertices of this parametric interval with
the corresponding chamber decomposition. In each of the chambers,
we compute the Bernstein coefficients. Note that the single variable that
was not considered to be a parameter in this computation no longer
appears in these Bernstein coefficients. The above procedure is then
nomial distribution. The probability mass function of the multinomial distribution is:
f(x1, . . . , xk; n, p1, . . . , pk) = Pr(X1 = x1 ∧ · · · ∧Xk = xk)
=
8><
>>:
n!
x1! · · ·xk!
px1
1
· · · p
xk
k = B
n
x (p) , when
Pk
i=1 xi = n
0 otherwise ,
for non-negative integers x1, . . . , xk. Since all values of a probability mass function lie
in the interval [0, 1] and sum to one, the Bernstein polynomials Bnx (p) for a given n
have a range in [0, 1] and sum to one when 0 ≤ pi ≤ 1 and
P
i
pi = 1
C.2 Incremental Bernstein expansion 187
applied recursively on the Bernstein coefficients until all variables have
been eliminated.
We will demonstrate this on the example used in the previous sec-
tion with p ∈ [6, 11].
• In the first step y is considered as a parameter, which results in a
one-dimensional function over a one-dimensional parameterized
polytope:
fp,y(x) = pyx
D = {(p, y) ∈ Q2 | p ∈ [6, 11], y ∈ [0, 11 − p]}
Pp,y = {(x) ∈ Q1 | x ∈ [0, 11 − p− y]} .
The parametric vertices are (0) and (11 − p− y) for all parameter
values.3 Now, Bernstein expansion is done:
(x) = α1(0) + α2(11 − p− y)
= α2(11 − p− y)
B1(1,0)(α) = α1 B
1
(0,1)(α) = α2
pyx = pyα2(11 − p− y) = py(11− p− y)B1(0,1)(α) .
The Bernstein coefficients are
b1(1,0) = 0
b1(0,1) = py(11− p− y) .
• In a second step y is considered as a variable and Bernstein expan-
sion is done on the Bernstein coefficients obtained in the previous
step. The first coefficient is 0 and does not have to be expanded
anymore. For the other coefficient we get:
fp(y) = py(11 − p− y)
D = {(p) ∈ Q1 | p ∈ [6, 11]}
Pp = {(y) ∈ Q1 | y ∈ [0, 11 − p]} .
3If p was not restricted to [6,11], there would have been two chambers:
C1 = {(p, y) ∈ Q
2 | 0 ≤ y ≤ 6, 0 ≤ p ≤ 6− y}
with parametric vertices (0) and (5) and
C2 = {(p, y) ∈ Q
2 | 0 ≤ y ≤ 6, 6− y ≤ p ≤ 11− y}
with parametric vertices (0) and (11− p− y).
188 Bernstein expansion
The parametric vertices are (0) and (11− p) for all parameter val-
ues. No division into chambers is needed.
(y) = α1(0) + α2(11− p)
= α2(11− p)
B2(2,0)(α) = α
2
1 B
2
(0,2)(α) = α
2
2 B
2
(1,1)(α) = 2α1α2
fp(y) = py(11− p− y)
= pα2(11 − p)(11− p− α2(11 − p))
= p(11− p)2α2(1− α2)
= p(11− p)2α2α1
=
p(11− p)2
2
B2(1,1)(α)
The Bernstein coefficients are
b2(2,0) = 0
b2(0,2) = 0
b2(1,1) =
p(11− p)2
2
,
which leads to the same bounds as in the previous section.
The motivation for considering an incremental computation is that
in the direct computation, the number of Bernstein coefficients is equal
to the number of multinomial coefficients
(
d
k1,k2,...,kN
)
, which is
(
d+N − 1
d
)
>
Nd
d!
,
where d is the degree of the polynomial andN is the number of vertices.
In the incremental computation, all parametric polytopes are inter-
vals. They therefore have 2 vertices and require only d + 1 Bernstein
coefficients. On the other hand, the interval Bernstein expansion has to
be performed in all nodes of a tree with 2n levels, where n is the num-
ber of variables, which branches alternately over the chambers and the
Bernstein coefficients. Furthermore, the final number of chambers may
be larger than the number of chambers obtained through a direct appli-
cation of Bernstein expansion. Note that an advantage of this possibly
C.2 Incremental Bernstein expansion 189
increased number of chambers is that the results may be more accu-
rate. Preliminary experiments have shown that the incremental com-
putation performs better than the direct computation, mainly because
the removal of redundant Bernstein coefficients can be applied at each
level, greatly reducing the total number of Bernstein coefficients con-
sidered.
Incremental Bernstein was thought up and implemented in the
barvinok library by Sven Verdoolaege.
190 Bernstein expansion
Appendix D
MRE of 3rd degree
polynomials
An upper bound on the relative difference (RE ) between continuous-domain
and discrete-domain extrema of polynomials is found in the proof of Theorem
3 in Section 4.5.1. For second degree polynomials an exact bound is given in
(4.16) or Theorem 5. This appendix determines an exact bound (MRE ) for
third degree polynomials (listed in Table 4.1(a)).
D.1 Problem formulation
Let the RE of a polynomial be defined as
RE =


(F c − F c)− (F d − F d)
F d − F d
when F d 6= F d ,
0 when F d = F d ∧ F c = F c ,
∞ when F d = F d ∧ F c 6= F c ,
with F c and F c the extrema of this polynomial in the continuous inter-
val [0, N ], N ∈ N (N ≥ 3), and F d and F d the maximum and minimum
in the discrete subdomain {0, 1, . . . ,N} = Z∩[0,N ]. Find a third degree
polynomial for which the RE becomes maximal.
192 MRE of 3rd degree polynomials
D.2 Solution
Consider the general third degree polynomial
f(x) = ax3 + bx2 + cx+ d . (D.1)
Without loss of generality we assume a > 0.
The derivative f ′(x) = 3ax2 + 2bx + c has discriminant D = 4b2 −
12ac. If b2 − 3ac < 0 the polynomial has no local extrema and RE =
0. When b2 − 3ac = 0 there is one point of inflection with horizontal
tangent and thus also RE = 0. For b2 − 3ac > 0 there are two local
extrema
xl =
−b−√b2 − 3ac
3a
xr =
−b+√b2 − 3ac
3a
.
Note that the point
(
−b
3a , f
(
−b
3a
))
is a point of symmetry of the polyno-
mial function.
If none of the extrema lies inside the interval [0,N ] RE will be 0.
Now, three cases are left, based on the locations of the local extrema:
Case 1: 0 ≤ xl < xr ≤ N
The extrema can be written as:
F c = max(f(xl), f(N))
F c = min(f(0), f(xr))
F d = max(f(⌊xl⌋), f(⌈xl⌉), f(N))
F d = min(f(0), f(⌊xr⌋), f(⌈xr⌉)) .
We will prove that the maximal RE is obtained when
f(⌊xl⌋) = f(⌈xl⌉) = F d
f(⌊xr⌋) = f(⌈xr⌉) = F d .
Since the RE is scale and position invariant we can assume f(xl) =
1 and f(xr) = −1. The polynomial is then fully determined by xl and
xr. We first ignore the influence of f(0) and f(N).
D.2 Solution 193
Lemma 2. Given f(xl) = 1, f(xr) = −1, xr = A + B and xl = A − B,
then
REr ,
(1−max(f(⌊xl⌋), f(⌈xl⌉)))− (−1−min(f(⌊xr⌋), f(⌈xr⌉)))
2
can only reach a local maximum in the (A,B)-space when 2A ∈ Z.
Proof. Note that A = −b3a , with a and b defined in (D.1), and (A, f(A))
is the symmetry point of f(x). From f(xl) = −f(xr) it follows that
f(A) = 0.
Assume 2A /∈ Z. Then f(⌊xl⌋) 6= f(⌈xl⌉) or f(⌊xr⌋) 6= f(⌈xr⌉).
When f(⌊xl⌋) = f(⌈xl⌉) and f(⌊xr⌋) = f(⌈xr⌉) there would exist i, j ∈
Z, i 6= j for which f(i) = f(i+ 1) and f(j) = f(j + 1) and thus
ai3 + bi2 + ci+ d = a(i+ 1)3 + b(i+ 1)2 + c(i+ 1) + d
0 = a(3i2 + 3i+ 1) + 2bi+ b+ c
0 = a(3j2 + 3j + 1) + 2bj + b+ c
0 = 3a(i2 − j2 + i− j) + 2b(i− j) .
Since i, j ∈ Z and i 6= j this results in
3a(i+ j + 1) + 2b = 0
2A =
−2b
3a
= i+ j + 1 ∈ Z .
1. f(⌊xl⌋) 6= f(⌈xl⌉) and f(⌊xr⌋) 6= f(⌈xr⌉).
There are four subcases:
(a) f(⌊xl⌋) > f(⌈xl⌉), f(⌊xr⌋) > f(⌈xr⌉).
If the tangents in ⌊xl⌋ and ⌈xr⌉ are not parallel then a hori-
zontal shift of the curve away from the point with the steep-
est tangent increases REr since the value of one discrete ex-
tremum then varies faster than the other (Figure D.1(a)). If
the tangents are parallel, i.e. f ′(⌊xl⌋) = f ′(⌈xr⌉) there exist
i, j ∈ Z, i 6= j with
f ′(i) = f ′(j)
3ai2 + 2bi+ c = 3aj2 + 2bj + c
3a(i+ j) + 2b = 0
2A =
−2b
3a
= i+ j ∈ Z .
194 MRE of 3rd degree polynomials
x
f(x)
⌊xl⌋ ⌈xr⌉⌈xl⌉ ⌊xr⌋ x
f(x)
⌊xl⌋⌈xl⌉ ⌊xr⌋⌈xr⌉
(a) (b)
Figure D.1: Examples illustrating the cases 1a and 1b of the relative positions
of f(⌊xl⌋), f(⌈xl⌉), f(⌊xr⌋) and f(⌈xr⌉). A horizontal arrowdenotes a horizon-
tal shift of the curve. The resulting vertical shifts of the values in the discrete
points are indicated with vertical arrows.
(b) f(⌊xl⌋) > f(⌈xl⌉), f(⌊xr⌋) < f(⌈xr⌉).
Shifting the curve to the right increasesREr. (Figure D.1(b)).
(c) f(⌊xl⌋) < f(⌈xl⌉), f(⌊xr⌋) > f(⌈xr⌉).
Analogous to 1b.
(d) f(⌊xl⌋) < f(⌈xl⌉), f(⌊xr⌋) < f(⌈xr⌉).
Analogous to 1a.
2. f(⌊xl⌋) = f(⌈xl⌉) or f(⌊xr⌋) = f(⌈xr⌉) There our four subcases
similar to case 1. In some cases it may be needed that B and A
vary together such that f(⌊xl⌋) = f(⌈xl⌉) or f(⌊xr⌋) = f(⌈xr⌉),
respectively. The detailed elaboration is omitted for brevity.
Lemma 3. If 2A ∈ Z then REr reaches a local maximum when f(⌊xl⌋) =
f(⌈xl⌉) and f(⌊xr⌋) = f(⌈xr⌉).
Proof. When varying B, max(f(⌊xl⌋), f(⌈xl⌉)) reaches a minimum
when f(⌊xl⌋) = f(⌈xl⌉) and min(f(⌊xr⌋), f(⌈xr⌉)) reaches a maxi-
mum when f(⌊xr⌋) = f(⌈xr⌉). Since 2A ∈ Z both equalities are always
satisfied or not at the same time.
Lemma 4. RE can only reach a local maximum when
f(0) ≥ min(f(⌊xr⌋), f(⌈xr⌉))
f(N) ≤ max(f(⌊xl⌋), f(⌈xl⌉)) .
D.2 Solution 195
Proof. If f(0) < min(f(⌊xr⌋), f(⌈xr⌉)) or f(N) > max(f(⌊xl⌋), f(⌈xl⌉))
RE can be increased by varying A and B appropriately. The details are
omitted for brevity.
Corollary 1. If 0 ≤ xl < xr ≤ N the maximal RE is reached when
F d = f(⌊xl⌋) = f(⌈xl⌉) ≥ f(N)
F d = f(⌊xr⌋) = f(⌈xr⌉) ≤ f(0) ,
and B is minimal.
Indeed, when f(xl) = 1 = −f(xr) and f(⌊xl⌋) = f(⌈xl⌉), a smaller
value of B leads to a higher value of f(xl)− f(⌊xl⌋) = f(xl)− f(⌈xl⌉).
This leads to the solutions listed at the end of Section 4.5.1. The
corresponding MREs are printed in Table 4.1(a).
Case 2: xl < 0 ≤ xr ≤ N
For a given polynomial it is always possible to construct a polynomial
with an equal or higher RE for which 0 ≤ xl < xr ≤ N . For brevity,
this is not proved here.
Case 3: 0 ≤ xl ≤ N < xr
Thanks to the point symmetry of the polynomial around (− b3a , f(− b3a))
this case can be converted into case 2.
196 MRE of 3rd degree polynomials
Appendix E
Bounds on
quasi-polynomials with the
barvinok library
E.1 Command line interface
The bernstein library used for Bernstein expansion is included in the
barvinok library [160]. Most of the methods described in Section 4.6
are implemented in the barvinok library and available through the
command line application barvinok maximize. This program reads
a piecewise (quasi-)polynomial from standard input and converts it to
a (set of) polynomial(s) on which Bernstein expansion is applied to find
upper and lower bounds. Following command line options are used in
the experiments:
Add Var = Default option
Split Periods --split=Threshold
Without the option --minimize, the maximum is output. Mod
Classes and Exact are not yet included in the library, at the time of writ-
ing, but implemented using the library.
The input format corresponds to the output of the enumeration ap-
plications, such as barvinok enumerate e. Here, the following com-
mand line options are used to have a piecewise polynomial as output.
198 Bounds on quasi-polynomials with the barvinok library
Drop Frac --polynomial-approximation=upper
--approximation-method=drop
Poly Approx --polynomial-approximation=upper
--approximation-method=scale
--scale-options=narrow
A command line could look like:
barvinok_enumerate < polytope.in \
| barvinok_maximize --split=10 --minimize
E.2 Library call
When iterating over a large number of schedule functions, using
the command line interface would terribly slow down the execution.
Therefore, the counting problems are generated using the internal data
types of the barvinok library and the enumeration and maximization
functions are called directly from within the C++ program that iterates
over the schedules.
Appendix F
Example of code generated
by CLooGVHDL
This appendix lists the code generated by CLooGVHDL corresponding
to the C code listed in Figure 5.5. The different blocks of the control
entity can be distinguished in the code:
for 1 line 103–147
for 2 line 149–193
ID 0 line 195–249
ID 1 line 251–314
ID 2 line 316–379
The rest of the code contains glue logic, statement argument assign-
ments (line 391–432) and a process that generates the value of the ready
output port (line 436–447).
1 −− Generated from filter comp . cloog by (VHDL)CLooG v0.12.2 64 bits in
0.024001s.
2 LIBRARY ieee;
3 LIBRARY work;
4 −−library
5 USEwork.test constants.all;
6 USE ieee.std logic 1164.all ;
7 USEwork.cloog functions.all;
8
9 entity filter comp is
10 port(
11 clk : in std logic ;
12 reset : in std logic ;
13 start : in std logic ;
200 Example of code generated by CLooGVHDL
14 lc : out std logic ;
15 ready : out std logic ;
16 S1 arg 0 : out integer range −512 to 511;
17 S1 arg 1 : out integer range −512 to 511;
18 S1 arg 2 : out integer range −512 to 511;
19 S1 arg 3 : out integer range −512 to 511;
20 S1 arg 4 : out integer range −512 to 511;
21 S1 arg 5 : out integer range −512 to 511;
22 S1 lc : in std logic ;
23 S2 arg 0 : out integer range −512 to 511;
24 S2 arg 1 : out integer range −512 to 511;
25 S2 arg 2 : out integer range −512 to 511;
26 S2 arg 3 : out integer range −512 to 511;
27 S2 arg 4 : out integer range −512 to 511;
28 S2 arg 5 : out integer range −512 to 511;
29 S2 lc : in std logic ;
30 start S1 : out std logic ;
31 start S2 : out std logic ;
32 cols : in integer range −512 to 511;
33 rows : in integer range −512 to 511
34 ) ;
35 end filter comp;
36 architecture cloog gen of filter comp is
37 signal S1 lc g : std logic ;
38 signal S2 lc g : std logic ;
39 signal for 1 ev : integer range −512 to 511;
40 signal for 1 lc : std logic ;
41 signal for 1 lc g : std logic ;
42 signal for 1 lc g unbuf : std logic ;
43 signal for 1 li : std logic ;
44 signal for 1 running : std logic ;
45 signal for 1 sv : integer range −512 to 511;
46 signal for 2 ev : integer range −512 to 511;
47 signal for 2 lc : std logic ;
48 signal for 2 lc g : std logic ;
49 signal for 2 lc g unbuf : std logic ;
50 signal for 2 li : std logic ;
51 signal for 2 running : std logic ;
52 signal for 2 sv : integer range −512 to 511;
53 signal guard S1 : std logic ;
54 signal guard S2 : std logic ;
55 signal guard for 1 : std logic ;
56 signal guard for 2 : std logic ;
57 signal i 1 ext , i 1 int , inc 1 : integer range −512 to 511;
58 signal i 2 ext , i 2 int , inc 2 : integer range −512 to 511;
Example of code generated by CLooGVHDL 201
59 signal p3 : integer range −512 to 511;
60 signal p5 : integer range −512 to 511;
61 signal seq 0 lc : std logic ;
62 signal seq 0 running : std logic ;
63 signal seq 1 lc : std logic ;
64 signal seq 1 running : std logic ;
65 signal seq 2 lc : std logic ;
66 signal seq 2 running : std logic ;
67 signal start S1 g : std logic ;
68 signal start S1 s : std logic ;
69 signal start S1 t2 : std logic ;
70 signal start S1 t2 int : std logic ;
71 signal start S2 g : std logic ;
72 signal start S2 s : std logic ;
73 signal start S2 t2 : std logic ;
74 signal start S2 t2 int : std logic ;
75 signal start for 1 : std logic ;
76 signal start for 1 g : std logic ;
77 signal start for 1 int : std logic ;
78 signal start for 2 : std logic ;
79 signal start for 2 g : std logic ;
80 signal start for 2 int : std logic ;
81 signal start inner for 1 , start inner for 2 : std logic ;
82 signal start seq 0 , start seq 1 , start seq 2 : std logic ;
83 signal t 0 , t 0 l0 , t 0 l1 , t 0 l2 : integer range 0 to 1;
84 signal t 1 , t 1 l1 , t 1 l2 : integer range 0 to 1;
85 signal t 2 , t 2 l2 : integer range 0 to 1;
86 begin
87 start seq 0 <= start ;
88 lc <= seq 0 lc;
89
90 t 0 l0 <= t 0;
91 t 0 l1 <= t 0 l0 ;
92 t 0 l2 <= t 0 l1 ;
93 t 1 l1 <= t 1;
94 t 1 l2 <= t 1 l1 ;
95 t 2 l2 <= t 2;
96 start S1 g <= start S1 t2 ;
97 start S1 s <= start S1 g when guard S1 =’1’ else ’0’;
98 start S1 <= start S1 s ;
99 start S2 g <= start S2 t2 ;
100 start S2 s <= start S2 g when guard S2 =’1’ else ’0’;
101 start S2 <= start S2 s ;
102
103 start seq 1 <= start for 1 or start inner for 1 ;
202 Example of code generated by CLooGVHDL
104 for 1 li <= ’1’ when ((i 1 ext+inc 1)> for 1 ev ) else ’0’;
105 for 1 lc <= for 1 li and seq 1 lc ;
106 for 1 lc g unbuf <= for 1 lc when guard for 1=’1’ else
start for 1 g ;
107 for 1 lc g <= for 1 lc g unbuf;
108 i 1 ext <= for 1 sv when ((start for 1 =’1’) and (for 1 running=’0’)
) else i 1 int ;
109
110 for 1 : process(clk)
111 begin
112 if (clk=’1’ and clk’event) then
113 if ( reset = ’1’) then
114 start inner for 1 <= ’0’;
115 for 1 running <= ’0’;
116 else −−reset
117 if for 1 running=’1’ then
118 if seq 1 lc =’1’ then
119 i 1 int <= i 1 int+inc 1 ;
120 if for 1 li =’1’ then
121 for 1 running <= ’0’;
122 start inner for 1 <= ’0’;
123 else −−not last iter
124 start inner for 1 <= ’1’;
125 end if ;
126 else −−seq lc
127 start inner for 1 <= ’0’;
128 end if ;−−seq lc
129 else −−for running
130 if start for 1 =’1’ then
131 if seq 1 lc = ’0’ then
132 i 1 int <= for 1 sv;
133 else
134 i 1 int <= for 1 sv + inc 1 ;
135 end if ;
136 if for 1 lc g unbuf = ’0’ then
137 for 1 running <= ’1’;
138 start inner for 1 <= seq 1 lc;
139 else
140 for 1 running <= ’0’;
141 start inner for 1 <= ’0’;
142 end if ;
143 end if ;
144 end if ; −−for running
145 end if ; −−else reset
146 end if ; −−clk
Example of code generated by CLooGVHDL 203
147 end process;
148
149 start seq 2 <= start for 2 or start inner for 2 ;
150 for 2 li <= ’1’ when ((i 2 ext+inc 2)> for 2 ev ) else ’0’;
151 for 2 lc <= for 2 li and seq 2 lc ;
152 for 2 lc g unbuf <= for 2 lc when guard for 2=’1’ else
start for 2 g ;
153 for 2 lc g <= for 2 lc g unbuf;
154 i 2 ext <= for 2 sv when ((start for 2 =’1’) and (for 2 running=’0’)
) else i 2 int ;
155
156 for 2 : process(clk)
157 begin
158 if (clk=’1’ and clk’event) then
159 if ( reset = ’1’) then
160 start inner for 2 <= ’0’;
161 for 2 running <= ’0’;
162 else −−reset
163 if for 2 running=’1’ then
164 if seq 2 lc =’1’ then
165 i 2 int <= i 2 int+inc 2 ;
166 if for 2 li =’1’ then
167 for 2 running <= ’0’;
168 start inner for 2 <= ’0’;
169 else −−not last iter
170 start inner for 2 <= ’1’;
171 end if ;
172 else −−seq lc
173 start inner for 2 <= ’0’;
174 end if ;−−seq lc
175 else −−for running
176 if start for 2 =’1’ then
177 if seq 2 lc = ’0’ then
178 i 2 int <= for 2 sv;
179 else
180 i 2 int <= for 2 sv + inc 2 ;
181 end if ;
182 if for 2 lc g unbuf = ’0’ then
183 for 2 running <= ’1’;
184 start inner for 2 <= seq 2 lc;
185 else
186 for 2 running <= ’0’;
187 start inner for 2 <= ’0’;
188 end if ;
189 end if ;
204 Example of code generated by CLooGVHDL
190 end if ; −−for running
191 end if ; −−else reset
192 end if ; −−clk
193 end process;
194
195 −− sequence 0
196 seq 0 lc <= ’1’ when ((t 0=1)and (for 1 lc g = ’1’) ) else ’0’;
197
198 seq 0 running p: process(clk)
199 begin
200 if (clk=’1’ and clk’event) then
201 if ( reset = ’1’) then
202 seq 0 running <= ’0’;
203 elsif seq 0 lc =’1’ then
204 seq 0 running <= ’0’;
205 elsif start seq 0 =’1’ then
206 seq 0 running <= ’1’;
207 end if ; −−else reset
208 end if ; −−clk
209 end process;
210
211 seq 0: process(clk)
212 begin
213 if (clk=’1’ and clk’event) then
214 start for 1 int <= ’0’;
215 if ( reset = ’1’) then
216 t 0 <= 0;
217 else −−reset
218 if (seq 0 running =’1’) or ( start seq 0 =’1’) then
219 case t 0 l0 is
220 when 0 =>
221 if for 1 lc g = ’1’ then
222 t 0 <= t 0 + 1;
223 start for 1 int <= ’1’;
224 end if ;
225 when 1 =>
226 if for 1 lc g = ’1’ then
227 t 0 <= 0;
228 end if ;
229 when others => null;
230 end case; −− t 0
231 end if ; −−running or start
232 end if ; −−else reset
233 end if ; −−clk
234 end process;
Example of code generated by CLooGVHDL 205
235 for 1 sv <=
236 0 when (t 0 l0=0) else
237 2 when (t 0 l0=1) else
238 0;
239 for 1 ev <=
240 1 when (t 0 l0=0) else
241 rows−1when (t 0 l0=1) else
242 1;
243 inc 1 <=
244 1 when (t 0 l0=0) else
245 1 when (t 0 l0=1) else
246 1;
247 guard for 1 <= ’0’ when (for 1 sv > for 1 ev) else ’1’;
248 start for 1 g <= ’1’ when (( start for 1 int =’1’) or ( start seq 0
=’1’ and ((( t 0=0)) ) ) ) else ’0’;
249 start for 1 <= start for 1 g and guard for 1;
250
251 −− sequence 1
252 seq 1 lc <= ’1’ when
253 (( t 0 l1 =0) and ((( t 1=0)and ( for 2 lc g = ’1’) ) ) )or(( t 0 l1 =1)
and ((( t 1=0)and ( for 2 lc g = ’1’) ) ) )
254 else ’0’;
255
256 seq 1 running p: process(clk)
257 begin
258 if (clk=’1’ and clk’event) then
259 if ( reset = ’1’) then
260 seq 1 running <= ’0’;
261 elsif seq 1 lc =’1’ then
262 seq 1 running <= ’0’;
263 elsif start seq 1 =’1’ then
264 seq 1 running <= ’1’;
265 end if ; −−else reset
266 end if ; −−clk
267 end process;
268
269 seq 1: process(clk)
270 begin
271 if (clk=’1’ and clk’event) then
272 start for 2 int <= ’0’;
273 if ( reset = ’1’) then
274 t 1 <= 0;
275 else −−reset
276 if (seq 1 running =’1’) or ( start seq 1 =’1’) then
277 case t 0 l1 is
206 Example of code generated by CLooGVHDL
278 when 0 =>
279 case t 1 l1 is
280 when 0 =>
281 if for 2 lc g = ’1’ then
282 t 1 <= 0;
283 end if ;
284 when others => null;
285 end case; −− t 1
286 when 1 =>
287 case t 1 l1 is
288 when 0 =>
289 if for 2 lc g = ’1’ then
290 t 1 <= 0;
291 end if ;
292 when others => null;
293 end case; −− t 1
294 when others => null;
295 end case; −− t 0
296 end if ; −−running or start
297 end if ; −−else reset
298 end if ; −−clk
299 end process;
300 for 2 sv <=
301 1 when (t 0 l1=0) and ( t 1 l1=0) else
302 1 when (t 0 l1=1) and ( t 1 l1=0) else
303 1;
304 for 2 ev <=
305 cols−2when (t 0 l1=0) and (t 1 l1=0) else
306 cols−2when (t 0 l1=1) and (t 1 l1=0) else
307 cols−2;
308 inc 2 <=
309 1 when (t 0 l1=0) and ( t 1 l1=0) else
310 1 when (t 0 l1=1) and ( t 1 l1=0) else
311 1;
312 guard for 2 <= ’0’ when (for 2 sv > for 2 ev) else ’1’;
313 start for 2 g <= ’1’ when (( start for 2 int =’1’) or ( start seq 1
=’1’ and ((( t 0 l1 =0) and (t 1=0)) or (( t 0 l1 =1) and (t 1=0))
) ) ) else ’0’;
314 start for 2 <= start for 2 g and guard for 2;
315
316 −− sequence 2
317 seq 2 lc <= ’1’ when ((t 0 l2=0) and ((( t 1 l2 =0) and ((( t 2=0)and
(S1 lc g = ’1’) ) ) ) ) )or(( t 0 l2 =1) and ((( t 1 l2 =0) and ((( t 2
=1)and (S2 lc g = ’1’) ) ) ) ) )
318 else ’0’;
Example of code generated by CLooGVHDL 207
319
320 seq 2 running p: process(clk)
321 begin
322 if (clk=’1’ and clk’event) then
323 if ( reset = ’1’) then
324 seq 2 running <= ’0’;
325 elsif seq 2 lc =’1’ then
326 seq 2 running <= ’0’;
327 elsif start seq 2 =’1’ then
328 seq 2 running <= ’1’;
329 end if ; −−else reset
330 end if ; −−clk
331 end process;
332
333 seq 2: process(clk)
334 begin
335 if (clk=’1’ and clk’event) then
336 start S1 t2 int <= ’0’;
337 start S2 t2 int <= ’0’;
338 if ( reset = ’1’) then
339 t 2 <= 0;
340 else −−reset
341 if (seq 2 running =’1’) or ( start seq 2 =’1’) then
342 case t 0 l2 is
343 when 0 =>
344 case t 1 l2 is
345 when 0 =>
346 case t 2 l2 is
347 when 0 =>
348 if S1 lc g = ’1’ then
349 t 2 <= 0;
350 end if ;
351 when others => null;
352 end case; −− t 2
353 when others => null;
354 end case; −− t 1
355 when 1 =>
356 case t 1 l2 is
357 when 0 =>
358 case t 2 l2 is
359 when 0 =>
360 if S1 lc g = ’1’ then
361 t 2 <= t 2 + 1;
362 start S2 t2 int <= ’1’;
363 end if ;
208 Example of code generated by CLooGVHDL
364 when 1 =>
365 if S2 lc g = ’1’ then
366 t 2 <= 0;
367 end if ;
368 when others => null;
369 end case; −− t 2
370 when others => null;
371 end case; −− t 1
372 when others => null;
373 end case; −− t 0
374 end if ; −−running or start
375 end if ; −−else reset
376 end if ; −−clk
377 end process;
378 start S1 t2 <= ’1’ when (( start S1 t2 int =’1’) or ( start seq 2 =’1’
and ((( t 0 l2 =0) and ( t 1 l2=0) and (t 2=0)) or (( t 0 l2 =1)
and ( t 1 l2 =0) and (t 2=0)) ) ) ) else ’0’;
379 start S2 t2 <= start S2 t2 int ;
380
381 p3 <=
382 i 1 ext when (t 0 l0=0) else
383 i 1 ext when (t 0 l0=1) else
384 i 1 ext ;
385 p5 <=
386 i 2 ext when (t 0 l1=0) and ( t 1 l1=0) else
387 i 2 ext when (t 0 l1=1) and ( t 1 l1=0) else
388 i 2 ext ;
389 guard S1 <= ’1’;
390 guard S2 <= ’1’;
391 S1 arg 0 <=
392 0 when ((t 0 l2=0) and ( t 1 l2=0) and ( t 2 l2 =0)) else
393 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =0)) else
394 0;
395 S1 arg 1 <=
396 0 when ((t 0 l2=0) and ( t 1 l2=0) and ( t 2 l2 =0)) else
397 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =0)) else
398 0;
399 S1 arg 2 <=
400 p3when ((t 0 l2=0) and ( t 1 l2 =0) and ( t 2 l2=0)) else
401 p3when ((t 0 l2=1) and ( t 1 l2 =0) and ( t 2 l2=0)) else
402 p3;
403 S1 arg 3 <=
404 0 when ((t 0 l2=0) and ( t 1 l2=0) and ( t 2 l2 =0)) else
405 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =0)) else
406 0;
Example of code generated by CLooGVHDL 209
407 S1 arg 4 <=
408 p5when ((t 0 l2=0) and ( t 1 l2 =0) and ( t 2 l2=0)) else
409 p5when ((t 0 l2=1) and ( t 1 l2 =0) and ( t 2 l2=0)) else
410 p5;
411 S1 arg 5 <=
412 0 when ((t 0 l2=0) and ( t 1 l2=0) and ( t 2 l2 =0)) else
413 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =0)) else
414 0;
415 S2 arg 0 <=
416 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =1)) else
417 0;
418 S2 arg 1 <=
419 1 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =1)) else
420 1;
421 S2 arg 2 <=
422 p5when ((t 0 l2=1) and ( t 1 l2 =0) and ( t 2 l2=1)) else
423 p5;
424 S2 arg 3 <=
425 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =1)) else
426 0;
427 S2 arg 4 <=
428 p3−1when ((t 0 l2=1) and (t 1 l2=0) and ( t 2 l2 =1)) else
429 p3−1;
430 S2 arg 5 <=
431 0 when ((t 0 l2=1) and ( t 1 l2=0) and ( t 2 l2 =1)) else
432 0;
433 S1 lc g <= S1 lc when guard S1=’1’ else start S1 g;
434 S2 lc g <= S2 lc when guard S2=’1’ else start S2 g;
435
436 ready p: process(clk)
437 begin
438 if (clk=’1’ and clk’event) then
439 if ( reset = ’1’) then
440 ready<= ’1’;
441 elsif start = ’1’ and seq 0 lc = ’0’ then
442 ready<= ’0’;
443 elsif seq 0 lc = ’1’ then
444 ready<= ’1’;
445 end if ; −−else reset
446 end if ; −−clk
447 end process;
448 end architecture cloog gen;
210 Example of code generated by CLooGVHDL
Index
abstract syntax tree, 39, 70
Add Var, 96
alive, 10, 56, 104
array representation, 76
ASIC, 2
backward reuse distance, 26
bandwidth limited, 26, 149, 158
Barvinok library, 77, 197
basic methods, 115
behavioral representation, 6
Bernstein
basis, 80, 184
coefficients, 80, 81, 186
expansion, 80, 101, 108, 117,
121, 183
incremental, 102, 103, 186
polynomial, 185
block-based, see IDWT
blocking read/write, 17
bounds on the range of
polynomials, 79, 183, 186
quasi-polynomials, 94, 197
cache, 7, 24, 27
associativity, 27
hit, 24
line, 24
miss, 24
policy, 24
canonical order, 61
chamber, 77, 102, 108, 115, 116, 184
circuit level, 5
CLooG, 52, 67, 125, 134
control optimization, 52, 67,
135, 141
CLooGVHDL, 125, 128, 150, 155,
199
code generation, 49, 52, 67
combined data/parameter space,
36, 37
combined space-time, 6
commutativity, 43, 58, 69, 106, 110
computation limited, 151, 153
compute bound, 21, see computa-
tion limited
conflict miss, 27
convex rational polyhedron, 32
data/parameter space, see com-
bined data/parameter space
dependence, 41
direct, 44
false dependence, 42
indirect, 44
Read after Write, 41
true dependence, 42
Write after Read, 42
Write after Write, 42
depth, 38, 39
design flow, 4
discrete wavelet transform, 165,
see IDWT
Drop Frac, 101
duplication, 57
dynamic reconfiguration, 2
212 INDEX
Ehrhart quasi-polynomial, see qua-
si-polynomial
elementary loop transformation,
see loop transformation
Exact, 99
face, see polyhedron
factorized implementation, 126,
130
false dependence, see dependence
field programmable gate array, 3
filter, 165
linear phase, 166
FPGA, 3
fractional (notation), 76
functional block, 8
fusion, see loop transformation
Handel-C, 21, 155
high-level synthesis, 1, 6, 8, 15, 16,
19, 22, 123, 125, 160
hit, see cache
hybrid method, 115
identifier block, 126, 134
IDWT, 123, 135, 148, 167
block-based, 68, 179
lifting scheme, 169
line-based, 63, 179
row-column-wise, 61, 62
tiled vertically, 64, 65
variant, 61, 63, 178
implicit representation, 32
Impulse C, 18, 155
integer polytope/polyhedron, see
polyhedron
interchange, see loop transforma-
tion
inverse discrete wavelet trans-
form, see IDWT
iteration domain, 38
iteration vector, 38
iterative compilation, 10, 12, 121
Kahn Process network, 17
Lagrange interpolation, 82
lazy wavelet transform, 169
least recently used, see LRU
Lebesgue function, 83
lexical depth, 38
lexicographical order, 32
lifting scheme, 169
line, see cache, see polyhedron
line-based, see IDWT
live, see alive
living, see alive
locality
spatial, 8, 25
temporal, 8, 25
loop bound, 126, 134
simplification, 141
loop counter (block), 126, 134
loop transformation, 45–53
chunking, 30
commutativity, 58
elementary, 50, 58, 63, 66
fusion, 30, 48, 50, 54, 56, 57, 60,
64, 65, 69, 71
fusion-like, 30
interchange, 30, 48, 50, 60, 63,
64, 71
peeling, 46, 50, 58, 60, 65, 69
primary sequence, 12, 63, 66,
69
sequence, 54–69
shift, 48, 49, 53, 54, 56–58, 60,
61, 64, 69
skewing, 46, 48, 50, 58, 60
strip-mine, 46, 50, 50, 57, 60
tiling, 30, 48, 50, 56, 64, 65, 67,
69
INDEX 213
tiling-like, 30
unimodular, 60, 61
LRU, 24, 27, 67, 181
memory access, 26
memory bottleneck, 7
memory bound, 21, see bandwidth
limited
memory hierarchy, 7, 24, 156
memory limited, 26, see band-
width limited
memory reference, 26
memory size estimation, 103, 116
Method of Differences, 134, 144
Minkowski (or geometric) repre-
sentation, 33
miss, see cache
Mod Classes, 96
Models of Computation, 17
MRE, 84, 85, 88, 90, 191
ordering vector, 39
overlap, 65
parallelism, 22, 26, 128, 136
parameterized polyhedron, 36
parametric vertex, 37, 184
partial canonical order, 61
partial reconfiguration, 3
peeling, see loop transformation
period, 76
periodic number, 76
Poly Approx, 101
polyhedral
chamber, see chamber
counting problem, 74, 76, 101
polyhedral model, 38–53, 124
polyhedron, 32
Z-polyhedron, 37
convex, 32
face, 34
implicit representation, 32
integer, 37, 78
k-polyhedron, 32
lattice, 37
line, 33
Minkowski (or geometric) rep-
resentation, 33
parameterized, 36
rational, 32
ray, 33
vertex, 33
polynomial, 78, 79, 81
polytope, 32, see polyhedron
Z-polytope, 37
Presburger formula, 77
primary sequence, see loop trans-
formation
process network, 17
processor-memory gap, 7, 23
PSNR, 178
quality, 177
quasi-polynomial, 76, 77, 79
conversion into polynomials,
94
piecewise, 77
rational polyhedron, 32
ray, see polyhedron
RE, 82, 84–86, 91, 191
reconfigurable hardware, 1
reconfiguration
dynamic, 2
partial, 3
static, 2
reconfiguration level, 6
reference distance, 118, 181
register transfer level, 4
relative approximation error, see
RE
RESUME, 158, 174, 178
214 INDEX
retiming, 144, 160
reuse
non-loop-carried, 30
pair, 26, 28
reuse distance, 26, 181
histogram, 28
Rivlin’s method, 80, 93
row-column-wise, see IDWT
RTL, 4
scalable video, 174
scattering function, 41
schedule, 41, 136
1-dimensional, 70, 103, 110
function, 41
vector, 40
scratch pad memory, 24
sequence of loop transformations,
see loop transformation
shift, see loop transformation
single assignment, 42
skewing, see loop transformation
SLO, 31, 62, 181
smart buffer, 20
SNR, 178
SOPC, 6
space-time, 6
spatial locality, 8, 25
Split Periods, 97
stack distance, 181
statement, 38
(lexical) depth, 38
duplication, 57
invocation, 38
iteration domain, 38
static analysis, 74, 118
static control part, 38
static reconfiguration, 2
Steps2Process, 136, 150
strip-mine, see loop transforma-
tion
structural representation, 6
SUIF, 20
System Builder, 6
system level, 5
temporal locality, 8, 25
transitive closure, 44
true dependence, see dependence
true reuse distance, 181
unimodular, see loop transforma-
tion
URUK, 49, 53, 66
variant, see IDWT
vector
ordering vector, 39
schedule vector, 40
vertex, 33, see polyhedron
parametric, 37, 184
Vim, 136, 150
WRaP-IT, 49, 52
Y-chart, 5
List of publications
Journal papers
[66] Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campenhout,
Erik H. D‘Hollander and Dirk Stroobandt. Finding and Applying Loop
Transformations for Generating Optimized FPGA Implementations.
Transactions on High Performance Embedded Architectures and Compilers I,
Lecture Notes in Computer Science, Vol. 4050, 2007, pp. 159–178.
[80] Hendrik Eeckhaut, Harald Devos, Peter Lambert, Davy De Schrijver,
Wim Van Lancker, Vincent Nollet, Prabhat Avasare, Tom Clerckx, Fabio
Verdicchio, Mark Christiaens, Peter Schelkens, Rik Van de Walle, Dirk
Stroobandt. Scalable, Wavelet-Based Video: from Server to Hardware-
Accelerated Client. IEEE Transactions on Multimedia, Vol. 9 (7), 2007, pp.
1508-1519
[82] Hendrik Eeckhaut, Dirk Stroobandt, Harald Devos and Mark Christi-
aens. Improving the Hardware Friendliness of a Wavelet Based Scal-
able Video Codec. WSEAS Transactions on Systems, Vol. 4 (5), 2005, pp.
625–634. Available from http://escher.elis.ugent.be/publ/
Edocs/DOC/P105_073.pdf.
[152] Dirk Stroobandt, Hendrik Eeckhaut, Harald Devos, Mark Christiaens,
Fabio Verdicchio and Peter Schelkens. Reconfigurable Hardware for
a Scalable Wavelet Video Decoder and Its Performance Requirements.
Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes
in Computer Science, Vol. 3133, 2004, pp. 203–212.
[68] HaraldDevos, Kristof Beyls, Mark Christiaens, Jan Van Campenhout and
Dirk Stroobandt. Systematic Design Space Exploration of the Discrete
Wavelet Transform. Submitted to Elsevier Digital Signal Processing. In
review since December 2006.
[72] Harald Devos, Sven Verdoolaege, Jan Van Campenhout, and Dirk Stroo-
bandt. Bounds on polynomials and quasi-polynomials over discrete do-
mains. Submitted to Computing. Preparing revised version.
216 List of publications
Conference papers with international reviewers
[71] Harald Devos, Jan Van Campenhout, and Dirk Stroobandt. Building
an Application-specific Memory Hierarchy on FPGA. In 2nd HiPEAC
Workshop on Reconfigurable Computing, January 2008, Go¨teborg, Sweden,
pp. 53–62. Available from http://escher.elis.ugent.be/publ/
Edocs/DOC/P108_009.pdf.
• Fabian Diet, Erik H. D‘Hollander, Kristof Beyls and Harald Devos. Em-
bedding Smart Buffers for Window Operations in a Stream-Oriented
C-to-VHDL Compiler. In 4th IEEE International Symposium on Electronic
Design, Test & Applications (DELTA’08), January 2008, Hong Kong.
[81] Hendrik Eeckhaut, Harald Devos and Dirk Stroobandt. The Energy Scal-
ability of Wavelet-based, Scalable Video Decoding. In 17th International
Workshop on Power And Timing Modeling, Optimization and Simulation,
September 2007, Go¨teborg, Sweden. Lecture Notes in Computer Science,
Vol. 4644, pp. 363–372.
[79] Hendrik Eeckhaut, Harald Devos, Philippe Faes, Mark Christiaens, Dirk
Stroobandt. FPGA design Methodology for a Wavelet-Based Scalable
Video Decoder. In Embedded Computer Systems: Architectures, Modeling,
and Simulation, 7th International Workshop, SAMOS 2007, Greece, Lecture
Notes in Computer Science, Vol. 4599, pp. 169–178.
[70] Harald Devos, Hendrik Eeckhaut, Mark Christiaens, Dirk Stroobandt.
Energy Scalability and the RESUME Scalable Video Codec. In Power-
aware Computing Systems, Dagstuhl Seminar Proceedings 07041, January
2007, Schloss Dagstuhl, Germany. Published on-line: http://drops.
dagstuhl.de/opus/volltexte/2007/1112/.
[67] Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campenhout
and Dirk Stroobandt. From Loop Transformation to Hardware Gen-
eration. In Proceedings of the 17th ProRISC Workshop, November 2006,
Veldhoven, The Netherlands, pp. 249–255. Available from http://
escher.elis.ugent.be/publ/Edocs/DOC/P106_224.pdf.
[77] Hendrik Eeckhaut, Mark Christiaens, Harald Devos, Dirk Stroobandt.
Implementing a Hardware-Friendly Wavelet Entropy Codec for Scal-
able Video. In Proceedings of SPIE: Wavelet Applications in Industrial Pro-
cessing III, Vol. 6001, October 2005, pp. 169–179.
• Hendrik Eeckhaut, Dirk Stroobandt, Harald Devos and Mark Christiaens.
Modeling Subbands of a Wavelet Based Scalable Video Codec. In Pro-
ceedings of the 9th WSEAS International CSCC Multiconference: Circuits
‘05, Systems ‘05, Computers ‘05, Communications ‘05, July 2005. Available
from http://escher.elis.ugent.be/publ/Edocs/DOC/P105_
074.pdf.
List of publications 217
• Hendrik Eeckhaut, Mark Christiaens, Benjamin Schrauwen, Harald Devos,
Dirk Stroobandt. Improving Parallelism of a Hardware Friendly, Scal-
able Wavelet Entropy codec. In Proceedings of SPS-DARTS 2005 (CD-
ROM), April 2005, pp. 159–162. Available from http://escher.
elis.ugent.be/publ/Edocs/DOC/P105_024.pdf.
• Hendrik Eeckhaut, Harald Devos, Benjamin Schrauwen, Mark Christiaens,
Dirk Stroobandt. A Hardware-Friendly Wavelet Entropy Codec for
Scalable Video. In Proceedings of Design, Automation and Test in Europe
(DATE05), March 2005, Munich, Germany, pp. 14–19.
• Tom Clerckx, Fabio Verdicchio, Adrian Munteanu, Yiannis Andreopoulos,
Harald Devos, Hendrik Eeckhaut, Mark Christiaens, Dirk Stroobandt,
Diederik Verkest and Peter Schelkens. Complexity Scalable Motion
Compensated Temporal Filtering. In Proceedings of SPIE, Vol. 5558,
November 2004, pp. 116–127.
[69] Harald Devos, Hendrik Eeckhaut, Benjamin Schrauwen, Mark Chris-
tiaens, Dirk Stroobandt. Ever considered SystemC? In Proceedings of
the 15th ProRISC Workshop, November 2004, Veldhoven, The Nether-
lands, pp. 358–363. Available from http://escher.elis.ugent.
be/publ/Edocs/DOC/P104_120.pdf.
• Hendrik Eeckhaut, Mark Christiaens, Harald Devos, Benjamin Schrauwen,
and Dirk Stroobandt, Scalable and Hardware-friendlyWavelet Entropy
Coding. In Proceedings of the 15th ProRISC Workshop, November 2004,
Veldhoven, The Netherlands, pp. 369–376. Available from http://
escher.elis.ugent.be/publ/Edocs/DOC/P104_117.pdf.
• Harald Devos, Hendrik Eeckhaut, Mark Christiaens, Fabio Verdicchio, Dirk
Stroobandt and Peter Schelkens. Performance Requirements for Recon-
figurable Hardware for a Scalable Wavelet Video Decoder. In Proceed-
ings of ProRISC, November 2003, pp. 56–63. Available from http:
//escher.elis.ugent.be/publ/Edocs/DOC/P103_123.pdf.
• Jeroen De Maeyer, Harald Devos, Wim Meeus, Peter Verplaetse and Dirk
Stroobandt. Hardware Implementation of an EAN-13 Bar Code De-
coder. In Proceedings of the Asia and South Pacific Design Automation
Conference (ASP-DAC). January 2003, Kitakyushu, Japan, pp. 583–584.
Available from http://escher.elis.ugent.be/publ/Edocs/
DOC/P103_007.pdf.
• Harald Devos, Joni Dambre, Wim Meeus, Dirk Stroobandt and Jan Van
Campenhout. An exploration of synchronization solutions for parallel
short-range optical interconnect in mesochronous systems. In Proceed-
ings of SPIE: VCSELs and Optical Interconnects, Vol. 4942, October 2002,
Brugge, pp. 258–268. Available from http://escher.elis.ugent.
be/publ/Edocs/DOC/P102_164.pdf.
218 List of publications
Other publications
• Tom Degryse, Harald Devos and Dirk Stroobandt. Controller Area
Estimation for Automatic Design Space Exploration. Proceedings of the
seventh ACES Symposium, Edegem, September 2007, pp. 31–34. Avail-
able from http://escher.elis.ugent.be/publ/Edocs/DOC/
P107_170.pdf.
• Harald Devos, Jan Van Campenhout and Dirk Stroobandt. Finding
Bounds on Ehrhart Quasi-Polynomials. Proceedings of the seventh ACES
Symposium, Edegem, September 2007, pp. 35–38. Available from http:
//escher.elis.ugent.be/publ/Edocs/DOC/P107_168.pdf.
• Hendrik Eeckhaut, Mark Christiaens, Harald Devos, Philippe Faes,
Dirk Stroobandt. RESUME’s wavelet-based scalable video decoder.
DATE 2007 University Booth, April 2007, Nice, France, pp. S42. Avail-
able from http://escher.elis.ugent.be/publ/Edocs/DOC/
P107_043.pdf.
• Hendrik Eeckhaut, Mark Christiaens, Harald Devos, Philippe Faes,
Dirk Stroobandt. UGent delivers first real-time wavelet-based scalable
video decoder on FPGA. DSP valley newsletter, February 2007, pp. 5–
6. Available from http://escher.elis.ugent.be/publ/Edocs/
DOC/P107_018.pdf.
• Chris Bleakley, Tom Clerckx, Harald Devos, Matthias Grumer, Alex
Janek, Ulrich Kremer, Christian W. Probst, Phillip Stanley-Marbell,
Christian Steger, Vasanth Venkatachalam and Manuel Wendt. 07041
Working Group – Towards Interfaces for Integrated Performance and
Power Analysis and Simulation. In Power-aware Computing Systems,
Dagstuhl Seminar Proceedings 07041, January 2007, Schloss Dagstuhl,
Germany. Published on-line: http://drops.dagstuhl.de/opus/
volltexte/2007/1107/.
• Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campen-
hout and Dirk Stroobandt. Hardware Generation from the Polyhedral
Model. Proceedings of the sixth ACES Symposium, Edegem, October 2006,
pp. 23–26. Available from http://escher.elis.ugent.be/publ/
Edocs/DOC/P106_165.pdf.
• Harald Devos, Dirk Stroobandt and Jan Van Campenhout. Loop trans-
formations for generating scalable hardware. In Design, Automation
and Test in Europe; Fourth EDAA Ph.D. Forum, Mu¨nchen, March 2006,
on CD. Available from http://escher.elis.ugent.be/publ/
Edocs/DOC/P106_081.pdf.
• Harald Devos. Loop transformations for generating scalable hardware.
In 6th FirW PhD Symposium Gent, November 2005, on CD. Available
from http://escher.elis.ugent.be/publ/Edocs/DOC/P105_
184.pdf.
List of publications 219
• Harald Devos and Hendrik Eeckhaut. Performance Requirements for
Reconfigurable Hardware for a Scalable Video Decoder. In Fourth FTW
PhD Symposium Gent, December 2003, on CD. Available from http:
//escher.elis.ugent.be/publ/Edocs/DOC/P103_139.pdf.
220 List of publications
Bibliography
[1] Altera website. http://www.altera.com/.
[2] Bluespec. http://www.bluespec.com/.
[3] Celoxica. http://www.celoxica.com/.
[4] Impulse C. http://www.impulsec.com/.
[5] Open Research Compiler, Open64. http://ipf-orc.
sourceforge.net/ , http://www.open64.net/.
[6] SUIF Compiler System. http://suif.stanford.edu/.
[7] Vim, Vi IMproved. http://www.vim.org/.
[8] Xilinx website. http://www.xilinx.com/.
[9] The RESUME project: Reconfigurable Embedded Systems for
Use in Scalable Multimedia Environments. http://www.elis.
UGent.be/resume/, 2003–2006.
[10] Nios II C2H Compiler User Guide (ver 1.1), March 2007.
[11] Marleen Ade´. Data memory minimization for synchronous data flow
graphs emulated on DSP-FPGA targets. PhD thesis, Katholieke Uni-
versiteit Leuven, October 1996.
[12] Altera. Application note 420: Optimizing Nios II C2H compiler
results (ver 1.0). http://www.altera.com/literature/
an/AN420.pdf, July 2006.
[13] Altera. Stratix Device Handbook, January 2006.
222 BIBLIOGRAPHY
[14] Altera. White paper: Automated generation of hardware accel-
erators with direct memory access from ANSI/ISO standard C
functions (ver 1.0). http://www.altera.com/literature/
wp/wp-aghrdwr.pdf,May 2006.
[15] Altera. System interconnect fabric for memory-mapped interfaces,
Quartus II Handbook, Volume 4, Chapter 3, March 2007.
[16] Yiannis Andreopoulos, Peter Schelkens, Gauthier Lafruit, Kon-
stantinos Masselos, and Jan Cornelis. High-level cache modeling
for 2-D discrete wavelet transform implementations. Journal of
VLSI Signal Processing Systems for Signal Image and Video Technol-
ogy, 34(3):209–226, 2003.
[17] ARM. AMBA specification (Advanced Microcontroller Bus Ar-
chitecture). http://www.arm.com/.
[18] Arvind, Rishiyur S. Nikhil, Daniel L. Rosenband, and Nirav
Dave. High-level synthesis: an essential ingredient for designing
complex ASICs. In IEEE/ACM International Conference on Com-
puter Aided Design, ICCAD-2004, pages 775–782, November 2004.
[19] Pranav Ashar, Srinivas Devadas, and Arthur Richard Newton.
Sequential logic synthesis. Kluwer Academic Publishers, 1992.
[20] Christos A. Athanasiadis. Ehrhart polynomials, simplicial poly-
topes, magic squares and a conjecture of Stanley. Journal fu¨r die
reine und angewandte Mathematik (Crelles Journal), 2005(583):163–
174, June 2005.
[21] Ivan Auge´, Fre´de´ric Pe´trot, Franc¸ois Donnet, and Pascal Gomez.
Platform-based design from parallel C specifications. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
24(12):1811–1826, December 2005.
[22] Florin Balasa, Hongwei Zhu, and Ilie I. Luican. Computation of
storage requirements formulti-dimensional signal processing ap-
plications. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 15(4):447–460, April 2007.
[23] Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan,
and Peter Marwedel. Scratchpad memory: design alternative for
BIBLIOGRAPHY 223
cache on-chip memory in embedded systems. In CODES ’02: Pro-
ceedings of the tenth international symposium on Hardware/software
codesign, pages 73–78, New York, NY, USA, 2002. ACM Press.
[24] Alexander Barvinok and James E. Pommersheim. New Perspec-
tives in Algebraic Combinatorics, volume 38 of MSRI Publications,
chapter An Algorithmic Theory of Lattice Points in Polyhedra,
pages 91–147. Cambridge University Press, 1999.
[25] Alexander I. Barvinok. A polynomial time algorithm for counting
integral points in polyhedra when the dimension is fixed. Math-
ematics of Operations Research, 19(4):769–779, November 1994.
[26] Ce´dric Bastoul. Efficient code generation for automatic paral-
lelization and optimization. In ISPDC IEEE International Sympo-
sium on Parallel and Distributed Computing, pages 23–30, Ljubljana,
October 2003.
[27] Ce´dric Bastoul. Code generation in the polyhedral model is eas-
ier than you think. In PACT’13 IEEE International Conference on
Parallel Architecture and Compilation Techniques, pages 7–16, Juan-
les-Pins, September 2004.
[28] Ce´dric Bastoul. CLooG: A Loop Generator for Scanning Polyhedra,
User’s Guide, Edition 2.0 for CLooG 0.14.0, November 2005.
[29] Ce´dric Bastoul, Albert Cohen, Sylvain Girbal, Saurabh Sharma,
and Olivier Temam. Putting polyhedral loop transformations to
work. In LCPC’16 International Workshop on Languages and Com-
pilers for Parallel Computing, LNCS 2958, pages 209–225, College
Station, October 2003.
[30] Ce´dric Bastoul and Paul Feautrier. Improving data locality by
chunking. In Compiler Construction, volume 2622 of Lecture Notes
in Computer Science, pages 320–335, 2003.
[31] Ce´dric. Bastoul and Paul Feautrier. More legal transformations
for locality. In EURO-PAR Parallel Processing, Lecture Notes in
Computer Science, volume 3149, pages 272–283. Springer-Verlag
Berlin, 2004.
[32] Marcus Bednara, Frank Hannig, and Ju¨rgen Teich. Generation
of Distributed Loop Control. In Ed F. Deprettere, Ju¨rgen Teich,
224 BIBLIOGRAPHY
and Stamatis Vassiliadis, editors, Embedded Processor Design Chal-
lenges: Systems, Architectures, Modeling, and Simulation – SAMOS,
volume 2268 of Lecture Notes in Computer Science (LNCS), pages
154–170, Heidelberg, Germany, 2002. Springer.
[33] Marcus Bednara and Ju¨rgen Teich. Automatic synthesis of FPGA
processor arrays from loop algorithms. Journal of Supercomputing,
26(2):149–165, September 2003.
[34] L. A. Belady. A study of replacement algorithms for a virtualstor-
age computer. IBM Systems Journal, 5(2):78–101, 1966.
[35] Luca Benini, Polly Siegel, and Giovanni De Micheli. Saving
power by synthesizing gated clocks for sequential circuits. IEEE
Design & Test of Computers, 11(4):32–41, 1994.
[36] Domingo Benitez. Performance of reconfigurable architectures
for image-processing applications. Journal of Systems Architecture,
49(4-6):193–210, September 2003.
[37] K. Benkrid, D. Crookes, and A. Benkrid. Towards a general
framework for FPGA based image processing using hardware
skeletons. Parallel Computing, 28(7-8):1141–1154, August 2002.
[38] Erik Berg and Erik Hagersten. Fast data-locality profiling of na-
tive execution. In SIGMETRICS, pages 169–180, 2005.
[39] Kristof Beyls. SLO – Suggestions for Locality Optimizations.
http://slo.sourceforge.net/.
[40] Kristof Beyls. Software Methods to Improve Data Locality and Cache
Behavior. PhD thesis, Ghent University, June 2004.
[41] Kristof Beyls and Erik D’Hollander. Discovery of locality-
improving refactorings by reuse path analysis. InHPCC, volume
4208 of Lecture Notes in Computer Science, pages 220–229, 2006.
[42] Kristof Beyls and Erik H. D’Hollander. Generating cache hints
for improved program efficiency. Journal of Systems Architecture,
51(4):223–250, 2005.
[43] Kristof Beyls and Erik H. D’Hollander. Intermediately executed
code is the key to find refactorings that improve temporal data
locality. In CF ’06: Proceedings of the 3rd conference on Computing
BIBLIOGRAPHY 225
Frontiers, pages 373–382, New York, NY, USA, May 2006. ACM
Press.
[44] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peper-
straete. Cyclo-static dataflow. IEEE Transactions on Signal Pro-
cessing, 44(2):397–408, February 1996.
[45] Inc. Bluespec. Groundbreaking technology fromMIT fundamen-
tally alters approach to ASIC/FPGA creation by attacking root of
design issues. Press Release, March 2004. – Term Rewriting Sys-
tems Technology Licensed to EDA Tools Developer Bluespec as
Core of Flagship Product –.
[46] W. Bohm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker,
and W. Najjar. Mapping a single assignment programming
language to reconfigurable systems. Journal of Supercomputing,
21(2):117–130, February 2002.
[47] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and
I. Bolsens. Integrating system-level low power methodologies
into a real-life design flow. In Ninth International Workshop
Power and Timing Modeling, Optimization and Simulation (PAT-
MOS), pages 19–28, Kos, Greece, October 1999.
[48] Andrew Bruce, David Donoho, and Hong-Ye Gao. Wavelet anal-
ysis [for signal processing]. IEEE Spectrum, 33(10):26–35, October
1996.
[49] Betul Buyukkurt, Zhi Guo, and Walid Najjar. Impact of loop un-
rolling on area, throughput and clock frequency in ROCCC: C
to VHDL compiler for FPGA. In Koen Bertels, Joao M.P. Car-
doso, and Stamatis Vassiliadis, editors, Reconfigurable computing:
architectures and applications: second international workshop, ARC
2006, Delft, The Netherlands: revised selected papers, volume 3985 of
LNCS, pages 401–412, March 2006.
[50] Timothy J. Callahan, John R. Hauser, and John Wawrzynek. The
Garp architecture and C compiler. IEEE Computer, 33(4):62–69,
April 2000.
[51] Patrick Carribault and Albert Cohen. Applications of storage
mapping optimization to register promotion. In ICS ’04: Proceed-
ings of the 18th annual International Conference on Supercomputing,
pages 247–256. ACM Press, 2004.
226 BIBLIOGRAPHY
[52] Francky Catthoor, Sven Wuytack, Eddy De Greef, Florin Balasa,
Lode Nachtergaele, and Arnout Vandecappelle. Custom Memory
Management Methodology: Exploration of Memory Organisation for
Embedded Multimedia System Design. Kluwer Academic Publish-
ers, Boston, USA, 1998.
[53] Chaitali Chakrabarti and Clint Mumford. Efficient realizations of
encoders and decoders based on the 2-D discrete wavelet trans-
form. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems, 7(3):289–298, September 1999.
[54] Christos Chrysafis and Antonio Ortega. Line-based, reduced
memory, wavelet image compression. IEEE Transactions on Im-
age Processing, 9(3):378–389, March 2000.
[55] P. Clauss and V. Loechner. Parametric analysis of polyhedral it-
eration spaces. Journal of VLSI Signal processing systems for signal
image and video technology, 19(2):179–194, July 1998.
[56] Philippe Clauss, Javier Federico Fernandez, Diego Garbervetsky,
and Sven Verdoolaege. Symbolic polynomial maximization over
convex sets and its application to memory requirement estima-
tion. Technical Report 06-04, Universite´ Louis Pasteur, October
2006. Available from http://icps.u-strasbg.fr/upload/
icps-2006-173.pdf.
[57] Albert Cohen, Sylvain Girbal, David Parello, Marc Sigler, Olivier
Temam, and Nicolas Vasilache. Facilitating the search for compo-
sitions of program transformations. In ACM Int. Conf. on Super-
computing (ICS’05), Boston, Massachusetts, June 2005.
[58] Katherine Compton and Scott Hauck. Reconfigurable comput-
ing: a survey of systems and software. ACM Computing Surveys,
34(2):171–210, 2002.
[59] Katherine Compton, Zhiyuan Li, James Cooley, Stephen Knol,
and Scott Hauck. Configuration relocation and defragmentation
for run-time reconfigurable computing. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 10(3):209–220, June 2002.
[60] Ingrid Daubechies and Wim Sweldens. Factoring wavelet trans-
forms into lifting steps. Journal of Fourier Analysis and Applications,
4(3):245–267, 1998.
BIBLIOGRAPHY 227
[61] Philip J. Davis. Interpolation and Approximation. Dover Publica-
tions, Inc., New York, 1975.
[62] Jesus A. De Loera, Raymond Hemmecke, Matthias Koppe, and
Robert Weismantel. Integer polynomial optimization in fixed di-
mension. Mathematics of operations research, 31(1):147–153, Febru-
ary 2006.
[63] A. DeHon, J. Adams, M. DeLorimier, N. Kapre, Y. Matsuda,
H. Naeimi, M. Vanier, and M. Wrighton. Design patterns for re-
configurable computing. In 12th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, FCCM, pages 13–23,
April 2004.
[64] Andre´ DeHon. The density advantage of configurable comput-
ing. IEEE Computer, 33(4):41–49, April 2000.
[65] Steven Derrien and Tanguy Risset. Interfacing compiled FPGA
programs: the MMAlpha approach. In Int. conf. on Parallel and
Distributed Processing Techniques and Applications, Rennes Cedex,
June 2000.
[66] Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campen-
hout, Erik H. D’Hollander, and Dirk Stroobandt. Finding and
applying loop transformations for generating optimized FPGA
implementations. Transactions on High Performance Embedded Ar-
chitectures and Compilers I, LNCS, 4050:159–178, 2007.
[67] Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campen-
hout, and Dirk Stroobandt. From loop transformation to hard-
ware generation. In Proceedings of the 17th ProRISC Workshop,
pages 249–255, Veldhoven, November 2006.
[68] Harald Devos, Kristof Beyls, Mark Christiaens, Jan Van Campen-
hout, and Dirk Stroobandt. Systematic design space exploration
of the discrete wavelet transform. Sumbitted to Elsevier Digital Sig-
nal Processing, 2006.
[69] Harald Devos, Hendrik Eeckhaut, Benjamin Schrauwen, Mark
Christiaens, and Dirk Stroobandt. Ever considered SystemC? In
Proceedings of the 15th ProRISC Workshop, pages 358–363, Veld-
hoven, November 2004.
228 BIBLIOGRAPHY
[70] Harald Devos, Eeckhaut Hendrik, Mark Christiaens, and
Dirk Stroobandt. Energy scalability and the RESUME scal-
able video codec. In Luca Benini, Naehyuck Chang, Ul-
rich Kremer, and Christian W. Probst, editors, Power-aware
Computing Systems, number 07041 in Dagstuhl Seminar Pro-
ceedings, Dagstuhl, Germany, January 2007. Internationales
Begegnungs- und Forschungszentrum fu¨r Informatik (IBFI),
Schloss Dagstuhl, Germany. http://drops.dagstuhl.de/
opus/volltexte/2007/1112.
[71] Harald Devos, Dirk Stroobandt, and Jan Van Campenhout.
Building an application-specific memory hierarchy on FPGA. In
2nd HiPEAC Workshop on Reconfigurable Computing, pages 53–62,
Go¨teborg, Sweden, January 2008.
[72] Harald Devos, SvenVerdoolaege, Jan VanCampenhout, andDirk
Stroobandt. Bounds on polynomials and quasi-polynomials over
discrete domains. Computing. Preparing revised version.
[73] Pedro Diniz, Mary Hall, Joonseok Park, Byoungro So, and Heidi
Ziegler. Automatic mapping of C to FPGAs with the DEFACTO
compilation and synthesis system. Microprocessors and Microsys-
tems, 29(2-3):51–62, April 2005.
[74] Florian Dittmann, AchimRettberg, and Fabian Schulte. A Y-chart
based tool for reconfigurable system design. In Workshop on Dy-
namically Reconfigurable Systems (DRS), Innsbruck (Austria), pages
67–73. VDE Verlag, March 2005.
[75] Yazhuo Dong, Yong Dou, and Jie Zhou. Optimized generation of
memory structure in compiling window operations onto recon-
figurable hardware. In International Workshop on Applied Reconfig-
urable Computing (ARC), volume 4419 of Lecture Notes in Computer
Science, pages 110–121, 2007.
[76] Franc¸ois Donnet. Synthe`se de Haut Niveau Controˆle´e par
l’Utilisateur. PhD thesis, Universite´ Pierre et Marie Curie, Paris,
France, January 2004.
[77] Hendrik Eeckhaut, Mark Christiaens, Harald Devos, and Dirk
Stroobandt. Implementing a hardware-friendly wavelet entropy
codec for scalable video. In F. Truchetet and O. Laligant, editors,
Proceedings of SPIE: Wavelet Applications in Industrial Processing III,
BIBLIOGRAPHY 229
volume 6001, pages 169–179, Boston, Massachusetts, USA, Octo-
ber 2005. SPIE.
[78] Hendrik Eeckhaut, Mark Christiaens, and Dirk Stroobandt. Im-
proving external memory access for Avalon systems on pro-
grammable chips. In FPL’07, 17th International Conference on Field
Programmable Logic and Applications, August 2007.
[79] Hendrik Eeckhaut, Harald Devos, Philippe Faes, Mark Christi-
aens, and Dirk Stroobandt. FPGA design methodology for a
wavelet-based scalable video decoder. In Stamatis Vassiliadis,
Mladen Berekovic, and Timo Ha¨ma¨la¨inen, editors, Embedded
Computer Systems: Architectures, Modeling, and Simulation, volume
4599 of Lecture Notes in Computer Science, pages 169–178, Samos,
Greece, July 2007. Springer.
[80] Hendrik Eeckhaut, Harald Devos, Peter Lambert, Davy
De Schrijver, Wim Van Lancker, Vincent Nollet, Prabhat
Avasare, Tom Clerckx, Fabio Verdicchio, Mark Christiaens, Pe-
ter Schelkens, Rik Van de Walle, and Dirk Stroobandt. Scalable,
wavelet-based video: from server to hardware-accelerated client.
IEEE Transactions on Multimedia, 9(7):1508–1519, November 2007.
[81] Hendrik Eeckhaut, Harald Devos, and Dirk Stroobandt. The en-
ergy scalability of wavelet-based, scalable video decoding. In
17th International Workshop on Power And Timing Modeling, Opti-
mization and Simulation (PATMOS), volume 4644 of Lecture Notes
in Computer Science, pages 363–372, Go¨teborg, Sweden, Septem-
ber 2007. Springer-Verlag.
[82] Hendrik Eeckhaut, Dirk Stroobandt, Harald Devos, and Mark
Christiaens. Improving the hardware friendliness of a wavelet
based scalable video codec. WSEAS Transactions on Systems,
4(5):625–634, May 2005.
[83] Paul Feautrier. Array expansion. In ICS ’88: Proceedings of the 2nd
international conference on Supercomputing, pages 429–441, New
York, NY, USA, 1988. ACM Press.
[84] Paul Feautrier. Some efficient solutions to the affine scheduling
problem. Part I. One-dimensional time. International Journal of
Parallel Programming, 21(5):313–348, October 1992.
230 BIBLIOGRAPHY
[85] Paul Feautrier. Some efficient solutions to the affine scheduling
problem. Part II. Multidimensional time. International Journal of
Parallel Programming, 21(6):389–420, December 1992.
[86] D. D. Gajski and L. Ramachandran. Introduction to high-level
synthesis. Design & Test of Computers, IEEE, 11(4):44–54, Winter
1994.
[87] Daniel D. Gajski and Robert H. Kuhn. NewVLSI tools. Computer,
16(12):11–14, December 1983.
[88] Ju¨rgen Garloff. The Bernstein expansion and its applica-
tions. Tutorial paper. Available from http://www-home.
fh-konstanz.de/˜garloff/BeExAppl.doc.
[89] Ju¨rgen Garloff. Convergent bounds for the range of multivariate
polynomials. In Proceedings of the International Symposium Inter-
val Mathematics, volume 212 of Lecture Notes in Computer Science,
pages 37–56, September 1985.
[90] Sylvain Girbal. Optimisations d’applications - Composition de trans-
formations de programme: mode`le et utils. PhD thesis, Universite´ de
Paris-Sud, 2005.
[91] Sylvain Girbal, Nicolas Vasilache, Ce´dric Bastoul, Albert Cohen,
David Parello, Marc Sigler, and Olivier Temam. Semi-automatic
composition of loop transformations for deep parallelism and
memory hierarchies. Int. J. Parallel Program., 34(3):261–317, 2006.
[92] S. Gopalsamy, Dilip Khandekar, and S. P. Mudur. A new method
of evaluating compact geometric bounds for use in subdivision
algorithms. Computer Aided Geometric Design, 8(5):337–356, 1991.
[93] Anne-Claire Guillou, Patrice Quinton, and Tanguy Risset. Hard-
ware synthesis for systems of recurrence equations with multi-
dimensional schedule. International Journal of Embedded Systems,
2005. To appear.
[94] Anne-Claire Guillou, Patrice Quinton, Tanguy Risset, and Daniel
Massicotte. Automatic design of VLSI pipelined LMS architec-
tures. In International Conference on Parallel Computing in Electrical
Engineering, PARELEC 2000, pages 144–149, Aug 2000.
BIBLIOGRAPHY 231
[95] Zhi Guo, Betul Buyukkurt, and Walid Najjar. Input data reuse in
compiling window operations onto reconfigurable hardware. In
LCTES ’04: Proceedings of the 2004 ACM SIGPLAN/SIGBED confer-
ence on Languages, compilers, and tools, pages 249–256. ACM Press,
July 2004.
[96] Zhi Guo, Betul Buyukkurt, Walid Najjar, and Kees Vissers. Opti-
mized generation of data-path from C codes for FPGAs. InDATE
’05: Proceedings of the conference on Design, Automation and Test in
Europe, pages 112–117, Washington, DC, USA, 2005. IEEE Com-
puter Society.
[97] Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. A quanti-
tative analysis of the speedup factors of FPGAs over processors.
In FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th Interna-
tional Symposium on Field Programmable Gate Arrays, pages 162–
170, New York, NY, USA, 2004. ACM Press.
[98] Sumit Gupta, Rajesh Gupta, Nikil Dutt, and Alexandru Nicolau.
SPARK, A Parallelizing Approach to the High-Level Synthesis of Dig-
ital Circuits. Kluwer Academic Publishers, 2004.
[99] Sumit Gupta, Nick Savoiu, Nikil Dutt, Rajesh Gupta, and Alex
Nicolau. Using global code motions to improve the quality of
results for high-level synthesis. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems,, 23(2):302–312,
February 2004.
[100] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, Shih-
Wei Liao, and E. Bu. Maximizing multiprocessor performance
with the SUIF compiler. Computer, 29(12):84–89, December 1996.
[101] Frank Hannig, Hritam Dutta, and Ju¨rgen Teich. Mapping a Class
of DependenceAlgorithms to Coarse-grained Reconfigurable Ar-
rays: Architectural Parameters and Methodology. International
Journal of Embedded Systems, 2(1/2):114–127, 2006.
[102] Frank Hannig and Ju¨rgen Teich. Design Space Exploration for
Massively Parallel Processor Arrays. In Victor Malyshkin, editor,
Proceedings of the Parallel Computing Technologies, 6th International
Conference (PaCT), volume 2127 of Lecture Notes in Computer Sci-
ence, pages 51–65, Novosibirsk, Russia, September 2001. Springer.
232 BIBLIOGRAPHY
[103] Reiner W. Hartenstein, Michael Herz, Thomas Hoffmann, and
Ulrich Nageldinger. Mapping applications onto reconfigurable
Kress Arrays. In FPL ’99: Proceedings of the 9th International Work-
shop on Field-Programmable Logic and Applications, pages 385–390,
London, UK, 1999. Springer-Verlag.
[104] James C. Hoe and Arvind. Hardware synthesis from term rewrit-
ing systems. In Proceedings of VLSI, pages 595–619, Lisbon, Por-
tugal, December 1999.
[105] James C. Hoe and Arvind. Operation-centric hardware descrip-
tion and synthesis. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 23(9):1277–1288, September 2004.
[106] Chao-Tsung Huang, Po-Chih Tseng, and Liang-Gee Chen. Anal-
ysis and VLSI architecture for 1-D and 2-D discrete wavelet trans-
form. IEEE Transactions on Signal Processing, 53(4):1575–1586,
April 2005.
[107] Ralf Hungerbu¨hler and Ju¨rgen Garloff. Bounds for the range of a
bivariate polynomial over a triangle. Reliable Computing, 4(1):3–
13, 1998.
[108] Christopher Hylands, Edward Lee, Jie Liu, Xiaojun Liu, Stephen
Neuendorffer, Yuhong Xiong, Yang Zhao, and Haiyang Zheng.
Overview of the Ptolemyproject. Technical report, Department of
Electrical Engineering and Computer Science, University of Cal-
ifornia, Berkeley, CA 94720, July 2003.
[109] Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt.
Data Reuse Analysis Technique for Software-ControlledMemory
Hierarchies. In DATE ’04: Proceedings of the conference on Design,
Automation and Test in Europe, volume 1, pages 202–207. IEEE
Computer Society, 2004.
[110] Gilles Kahn. The semantics of a simple language for parallel pro-
gramming. In J. L. Rosenfeld, editor, Proceedings of International
Federation for Information Processing Congress 74, (IFIP), pages 471–
475, Stockholm, Sweden, Augustus 1974. North Holland, Ams-
terdam.
[111] M. Kandemir, J. Ramanujam, M.J. Irwin, N. Vijaykrishnan, I. Ka-
dayif, and A. Parikh. A compiler-based approach for dynam-
ically managing scratch-pad memories in embedded systems.
BIBLIOGRAPHY 233
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 23(2):243–260, February 2004.
[112] Ashraf A. Kassim, Pingkun Yan, Wei Siong Lee, and Kuntal Sen-
gupta. Motion compensated lossy-to-lossless compression of 4-D
medical images using integer wavelet transforms. IEEE Transac-
tions on Information Technology in Biomedicine, 9(1):132–138, March
2005.
[113] I. Kharitonenko, Xing Zhang, and S. Twelves. A wavelet
transform with point-symmetric extension at tile boundaries.
IEEE Transactions on Image Processing, 11(12):1357–1364, Decem-
ber 2002.
[114] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs
and ASICs. IEEE Transactions on computer-aided design of integrated
circuits and systems, 26(2):203–215, February 2007.
[115] Gauthier Lafruit, Francky Catthoor, Jan P.H. Cornelis, and
Hugo J. De Man. An efficient VLSI architecture for 2-D wavelet
image coding with novel image scan. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 7(1):56–68, March 1999.
[116] Gauthier Lafruit, Lode Nachtergaele, Jan Bormans, Marc Engels,
and Ivo Bolsens. Optimal memory organization for scalable tex-
ture codecs in MPEG-4. IEEE Transactions on Circuits and Systems
for Video Technology, 9(2):218–243, March 1999.
[117] Monica S. Lam and Michael E. Wolf. A data locality optimizing
algorithm. SIGPLAN Not., 39(4):442–459, April 2004.
[118] Z. Li, K. Compton, and S. Hauck. Configuration cache manage-
ment techniques for FPGAs. In IEEE Symposium on FPGAs for
Custom Computing Machines, pages 22–36, 2000.
[119] Vincent Loechner and Doran K. Wilde. Parameterized polyhedra
and their vertices. International Journal of Parallel Programming,
25(6):525–549, December 1997.
[120] F. W. Luttmann and T. J. Rivlin. Some numerical experiments
in the theory of polynomial interpolation. IBM J. Res. Develop.,
9:187–191, 1965.
234 BIBLIOGRAPHY
[121] Stephane G. Mallat. A theory for multiresolution signal decom-
position: the wavelet representation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 11(7):674–693, July 1989.
[122] R. Martin, H.H. Shou, I. Voiculescu, A. Bowyer, and G.J. Wang.
Comparison of interval methods for plotting algebraic curves.
Computer Aided Geometric Design, 19(7):553–587, July 2002.
[123] M. Martonosi, A. Gupta, and T.E. Anderson. Tuning memory
performance of sequential and parallel programs. IEEE Computer,
28(4):32–40, April 1995.
[124] Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. Improv-
ing data locality with loop transformations. ACM Transactions on
Programming Languages and Systems, 18(4):424–453, July 1996.
[125] Bingfei Mei, Andy Lambrechts, Jean-Yves Mignolet, Diederik
Verkest, and Rudy Lauwereins. Architecture exploration for a
reconfigurable architecture template. IEEE Design & Test of Com-
puters, 22(2):90–101, March-April 2005.
[126] Bingfeng Mei, Serge Vernalde, Hugo De Man, and Rudy Lau-
wereins. Design and optimization of dynamically reconfigurable
embedded system. In International Conference on Engineering of Re-
configurable Systems and Algorithms, Las Vegas, Nevada USA, June
2001.
[127] Benoıˆt Meister. Stating and Manipulating Periodicity in the Polytope
Model. Applications to Program Analysis and Optimization. PhD the-
sis, Universite´ Louis Pasteur, Strasbourg, France, December 2004.
[128] Benoıˆt Meister. Approximation of polytope enumerators using
linear expansions, 2007. Manuscript in preparation.
[129] Vijay Menon, Keshav Pingali, and Nikolay Mateev. Fractal sym-
bolic analysis. ACM Transactions on Programming Languages and
Systems, 25(6):776–813, 2003.
[130] Petra Michel, Ulrich Lauther, and Peter Duzy. The Synthesis Ap-
proach to Digital System Design. Kluwer Academic Publishers,
Norwell, MA, USA, 1992.
[131] Miguel A. Miranda, Francky V.M. Catthoor, Martin Janssen, and
Hugo J. De Man. High-level address optimization and synthesis
BIBLIOGRAPHY 235
techniques for data-transfer-intensive applications. IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 6(4):677–686,
December 1998.
[132] M. J. Myjak and J. G. Delgado-Frias. A two-level reconfigurable
architecture for digital signal processing. Microelectron. Eng.,
84(2):244–252, 2007.
[133] Steve Oualline. Vi IMproved (VIM). New Riders, 2001.
[134] Emre O¨zer, Andy P. Nisbet, and David Gregg. Automatic cus-
tomization of embedded applications for enhanced performance
and reduced power using optimizing compiler techniques. In
Proceedings of the 10th European Conference on Parallel Computing
(Europar 04), volume 3149 of LNCS, pages 318–327, August 2004.
[135] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer,
C. Kulkarni, A. Vandecappelle, and P. G. Kjeldsberg. Data and
memory optimization techniques for embedded systems. ACM
Transactions on Design Automation of Electronic Systems, 6(2):149–
206, 2001.
[136] Louis-Noe¨l Pouchet, Ce´dric Bastoul, John Cavazos, and Albert
Cohen. A note on the performance distribution of affine sched-
ules. In 2ndWorkshop on Statistical and Machine learning approaches
to ARchitectures and compilaTion (SMART’08), 2008.
[137] Louis-Noe¨l Pouchet, Ce´dric Bastoul, Albert Cohen, and Nicolas
Vasilache. Iterative optimization in the polyhedral model: Part I,
one-dimensional time. In International Symposium on Code Gener-
ation and Optimization, pages 144–156, San Jose, California, March
2007. IEEE Computer Society.
[138] Louis-Noe¨l et al. Pouchet. Scalable empirical search of multidi-
mensional schedules. 2008. Accepted for PLDI’08 (Programming
Language Design and Implementation), Tucson, Arizona.
[139] H. Ratschek and J. Rokne. Computer Methods for the Range of Func-
tions. Halsted Press, Wiley, New York, 1984.
[140] Bob Ramakrishna Rau and Michael S. Schlansker. Embedded
computer architecture and automation. IEEE Computer, 34(4):75–
83, April 2001.
236 BIBLIOGRAPHY
[141] Martin C. Rinard and Pedro C. Diniz. Commutativity analy-
sis: a new analysis technique for parallelizing compilers. ACM
Transactions on Programming Languages and Systems, 19(6):942–991,
November 1997.
[142] Theodore J. Rivlin. Bounds on a polynomial. Journal of Re-
search of the National Bureau of Standards-B. Mathematical Sciences,
74B(1):47–54, 1970.
[143] Theodore J. Rivlin. An introduction to the approximation of functions.
Dover books on advanced mathematics. Dover Publications, Inc.,
New York, 1981. Corr. reprint of the 1969 orig. edition.
[144] J. Rokne. Bounds for an interval polynomial. Computing,
18(3):225–240, 1977.
[145] P. Schaumont, I. Verbauwhede, K. Keutzer, and M. Sarrafzadeh.
A quick safari through the reconfiguration jungle. In Proc. 38th
IEEE/ACM Design Automation Conference (DAC 2001), pages 172–
177, June 2001.
[146] Peter Schelkens, Adrian Munteanu, Joeri Barbarien, Mihnea
Galca, Xavier Giro-Nieto, and Jan Cornelis. Wavelet coding of
volumetric medical datasets. IEEE Transactions on Medical Imag-
ing, 22(3):441–458, March 2003.
[147] Robert Schreiber, Shail Aditya, Scott Mahlke, Vinod Kathail,
B. Ramakrishna Rau, Darren Cronquist, andMukund Sivaraman.
PICO-NPA: High-level synthesis of nonprogrammable hardware
accelerators. Journal of VLSI Signal Processing Systems for Signal
Image and Video Technology, 31(2):127–142, 2002.
[148] Alex Semenov, Albert M. Koelmans, Lee Lloyd, and Alexandre
Ykovlev. Designing an asynchronous processor using Petri nets.
IEEE Micro, 17(2):54–64, March/April 1997.
[149] Jeffrey Sheldon,Walter Lee, Ben Greenwald, and Saman Amaras-
inghe. Strength reduction of integer division and modulo opera-
tions. In Languages and Compilers for Parallel Computing: 14th In-
ternational Workshop, LCPC, volume 2624 of Lecture Notes in Com-
puter Science, pages 1–14, Cumberland Falls, Kentucky, August
2001.
BIBLIOGRAPHY 237
[150] Simon J. Smith. Lebesgue constants in polynomial interpolation.
Annales Mathematicae et Informaticae, 33:109–123, 2006.
[151] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A compiler ap-
proach to fast hardware design space exploration in FPGA-based
systems. In PLDI ’02: Proceedings of the ACM SIGPLAN 2002 Con-
ference on Programming language design and implementation, pages
165–176, New York, NY, USA, June 2002. ACM Press.
[152] Dirk Stroobandt, Hendrik Eeckhaut, Harald Devos, Mark Chris-
tiaens, Fabio Verdicchio, and Peter Schelkens. Reconfigurable
hardware for a scalable wavelet video decoder and its perfor-
mance requirements. In A.D. Pimentel and S. Vassiliadis, editors,
Computer Systems: Architectures, Modeling, and Simulation, volume
3133 of Lecture Notes in Computer Science, pages 203–212, Samos,
July 2004. Springer.
[153] Chris Sullivan, Alex Wilson, and Stephen Chappell. Using C
based logic synthesis to bridge the productivity gap. In ASP-
DAC ’04: Proceedings of the Asia and South Pacific Design Automa-
tion Conference 2004, pages 349–354, Piscataway, NJ, USA, 2004.
IEEE Press.
[154] Wim Sweldens and Peter Schro¨der. Building your own wavelets
at home. In Wavelets in Computer Graphics, pages 15–87. ACM
SIGGRAPH Course notes, 1996.
[155] Ju¨rgen Teich and Lothar Thiele. Exact partitioning of affine de-
pendence algorithms. In E. F. Deprettere, J. Teich, and S. Vas-
siliadis, editors, Embedded Processor Design Challenges, volume
2268 of Lecture Notes in Computer Science (LNCS), pages 135–153,
Springer, Berlin, March 2002.
[156] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere. Translat-
ing affine nested-loop programs to process networks. In CASES
’04: Proceedings of the 2004 international conference on Compilers, ar-
chitecture, and synthesis for embedded systems, pages 220–229, New
York, NY, USA, 2004. ACM Press.
[157] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere. Solving
out-of-order communication in Kahn process networks. Journal
of VLSI Signal Processing Systems, 40(1):7–18, 2005.
238 BIBLIOGRAPHY
[158] Tanja Van Achteren, Francky Catthoor, Rudy Lauwereins, and
Geert Deconinck. Search space definition and exploration for
nonuniform data reuse opportunities in data-dominant applica-
tions. ACM Transactions on Design Automation of Electronic Sys-
tems, 8(1):125–139, January 2003.
[159] Sven Verdoolaege. Incremental Loop Transformations and Enumera-
tion of Parametric Sets. PhD thesis, KULeuven, April 2005. Advi-
sors: M. Bruynooghe, F. Catthoor.
[160] Sven Verdoolaege. barvinok: User Guide, 0.24 edition, June
2007. Available from http://freshmeat.net/projects/
barvinok.
[161] Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, and
Francky Catthoor. Experiences with enumeration of integer pro-
jections of parametric polytopes. In R. Bodik, editor, Compiler
Construction: 14th International Conference, volume 3443 of Lecture
Notes in Computer Science, pages 91–105, Edinburgh, March 2005.
Springer.
[162] Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loech-
ner, and Maurice Bruynooghe. Counting integer points in para-
metric polytopes using Barvinok’s rational functions. Algorith-
mica, 48(1):37–66, June 2007.
[163] Jiacun Wang. Timed Petri Nets : Theory and Application. Kluwer
Academic Publishers, 1998. ISBN: 0-7923-8270-6.
[164] Michael E. Wolf and Monica S. Lam. A data locality optimizing
algorithm. In PLDI ’91: Proceedings of the ACMSIGPLAN1991 con-
ference on Programming language design and implementation, pages
30–44, New York, NY, USA, June 1991. ACM Press.
[165] Michael Wolfe. High performance compilers for parallel computing.
Addison-Wesley, 1996.
[166] Yana Yankova, Koen Bertels, Stamatis Vassiliadis, Roel Meeuws,
and Acrilio Virginia. Automated HDL generation: Comparative
evaluation. In Proceedings of International Symposium on Circuits
and Systems (ISCAS2007), May 2007.
[167] Nikos D. Zervas, Giorgos P. Anagnostopoulos, Vassilis
Spiliotopoulos, Yiannis Andreopoulos, and Costas E. Goutis.
BIBLIOGRAPHY 239
Evaluation of design alternatives for the 2-D discrete wavelet
transform. IEEE Transactions on Circuits and Systems for Video
Technology, 11(12):1246–1262, December 2001.
[168] Hongwei Zhu. Computation of Memory Requirements for Multi-
Dimensional Signal Processing Applications. PhD thesis, University
of Illinois at Chicago, 2006. Preliminary Doctoral Thesis.
[169] Hongwei Zhu, Ilie I. Luican, and Florin Balasa. Memory size
computation for multimedia processing applications. In ASP-
DAC ’06: Proceedings of the 2006 conference on Asia South Pacific de-
sign automation, pages 802–807, New York, NY, USA, 2006. ACM
Press.
[170] Gu¨nter M. Ziegler. Lectures on Polytopes (Graduate Texts in Mathe-
matics). Springer, July 2001.
[171] Claudiu Zissulescu, Bart Kienhuis, and Ed Deprettere. Expres-
sion synthesis in process networks generated by LAURA. In 16th
IEEE International Conference on Application-specific Systems, Archi-
tectures and Processors (ASAP), pages 15–21, July 2005.
240 BIBLIOGRAPHY
