Retargetable code generation based on an architecture description language by Hohenauer, Manuel
Retargetable Code Generation based on
an Architecture Description Language
Von der Fakulta¨t fu¨r Elektrotechnik und Informationstechnik
der Rheinisch–Westfa¨lischen Technischen Hochschule Aachen
zur Erlangung des akademischen Grades eines
Doktors der Ingenieurwissenschaften genehmigte Dissertation
vorgelegt von
Diplom–Ingenieur
Manuel Hohenauer
aus Krefeld/Nordrhein-Westfalen
Berichter:
Universita¨tsprofessor Dr. rer. nat. Rainer Leupers
Universita¨tsprofessorin Dr. rer. nat. Sabine Glesner
Tag der mu¨ndlichen Pru¨fung:
9. Januar 2009
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfu¨gbar.
Zusammenfassung
Die stetig wachsende Komplexita¨t und die zunehmenden Leistungsanforderungen neuer Anwen-
dungen aus den Bereichen drahtlose Kommunikation, Automotive oder Consumer Elektronik
haben den Entwurf und die Implementierung eingebetteter Systeme nachhaltig beeinflusst. Der
aktuelle Trend geht hin zu programmierbaren System-on-Chip (SoC)-Plattformen. Die Flexibilita¨t
von Software verbesserert die Design Effizienz und reduziert das Risiko und die Kosten von Hard-
ware Neuentwicklungen. Immer mehr solcher SoC-Schaltungen verwenden dabei Prozessoren mit
anwendungsspezifischem Instruktionssatz (ASIPs) als Bausteine. Sie bieten ausgezeichnete Eigen-
schaften in Bezug auf Rechenleistung, Leistungsaufnahme und Stu¨ckkosten. Dementsprechend
sind immer mehr kommerzielle Plattformen fu¨r einen effizienten Entwurf von ASIPs verfu¨gbar.
Diese Plattformen bestehen aus retargierbaren Softwareentwicklungswerkzeugen (C-compiler, As-
sembler, Linker, Simulator etc.), welche sich schnell an verschiedene Prozessorkonfigurationen an-
passen lassen. Die Eingangsbeschreibung solcher Werkzeuge ist u¨blicherweise ein Prozessormodell,
das in einer dedizierten Architektur Beschreibungssprache (ADL) beschrieben wird. Fortschritt-
lichere ADL ko¨nnen zusa¨tzlich ein synthetisierbares Hardwaremodell aus derselben Beschreibung
erzeugen.
Die gro¨ßte Herausforderung beim Entwurf einer ADL ist es, die Architekturbeschreibung fu¨r die
Erzeugung der einzelnen Werkzeuge eindeutig und konsistent zu erfassen. Das ist insbesondere
fu¨r den Compiler und den Simulator schwierig, denn beide brauchen die Informationen u¨ber
die Semantik der Instruktionen, allerdings aus einem anderen Blickwinkel. Ein Compiler, oder
genauer der Codeselektor, erfordert hauptsa¨chlich die Information, was eine Instruktion tut, um
C-Quellcode in a¨quivalente Assemblerinstruktionen zu u¨bersetzen. Demgegenu¨ber beno¨tigt der
Simulator die Information, wie eine Instruktion ausgefu¨hrt wird. Praktisch ist es a¨ußerst schwierig,
wenn nicht unmo¨glich, die eine Information aus der anderen herzuleiten. Bisherige ADL basierende
Ansa¨tze schra¨nken daher entweder die mo¨glichen Zielarchitekturen erheblich ein, oder unterstu¨tzen
nur die Generierung eines der beiden Softwarewerkzeuge.
Eine weitere Herausforderung in diesem Zusammenhang ist die retargierbare Kompilierung fu¨r
iii
iv
Hochsprachen wie C/C++. Mittlerweile sind Compiler unabdingbar geworden, um eine hohe
Softwareentwicklungsproduktivita¨t zu erzielen und die stetig wachsende Komplexita¨t heutiger
Anwendungen zu beherrschen. Retargierbare C-Compiler sind, verglichen mit handgeschriebenen
Compilern oder Assembler Programmen, allerdings oft durch ihre geringe Codequalita¨t behindert.
U¨blicherweise gibt es hier einen Abtausch zwischen der Flexibilita¨t des Compilers und der Qualita¨t
des erzeugten Codes. Um diese Lu¨cke in der Codequalita¨t zu schliessen, werden flexible, retargier-
bare Optimierungen beno¨tigt, welche sich schnell an sich a¨ndernde Zielprozessorkonfigurationen
anpassen lassen.
Zur Lo¨sung dieser Problemstellungen sind in dieser Arbeit Konzepte entwickelt und implemen-
tiert worden, die eine vollsta¨ndigen Codeselektorbeschreibung automatisch aus einer ADL erzeu-
gen ko¨nnen. Der Ansatz basiert auf der Language for Instruction Set Architectures (LISA) ADL
und einer Spracherweiterung zur konsistenten Beschreibung der Semantik von Instruktionen. Als
Ergebnis dieser Arbeit ko¨nnen Compiler, als Teil des LISA basierten Entwicklungsprozesses, vol-
lautomatisch retargiert werden und sind somit schon fru¨h in der Architekturentwicklung verfu¨gbar.
Dies tra¨gt zu einem insgesamt effizienteren Entwurfsprozess bei. Um eine hohe Flexibilita¨t hin-
sichtlich verschiedener Zielarchitekturen gewa¨hrleisten zu ko¨nnen, ist die Entwicklung von drei
sehr unterschiedlichen, real existierenden Prozessoren vorangetrieben worden. Weiterhin sind zwei
popula¨re Architekturklassen ausgewa¨hlt worden, die spezifische Optimierungstechniken beno¨tigen.
Es handelt sich dabei um ASIPs mit Unterstu¨tzung fu¨r Single Instruction Multiple Data (SIMD)
und Predicated Execution. Diese Arbeit implementiert diese Techniken derart, dass Retargier-
barkeit und hohe Codequalita¨t fu¨r die gegebene Prozessorklasse erreicht werden ko¨nnen. Daru¨ber
hinaus beschreibt sie einen retargierbarer Assembler zur effizienten Entwicklung von Optimierun-
gen auf Maschinencodeebene (z.B. Peephole-Optimierungen).
Abstract
Over the past few years, the ever increasing complexity and performance requirements of new wire-
less communications, automotive and consumer electronics applications are changing the way em-
bedded systems are designed and implemented today. The current trend is towards programmable
System-on-Chip platforms in order to improve the design efficiency and reduce the risk and costs
of hardware redesign cycles. An increasing number of such systems employ Application Specific
Instruction-set Processors (ASIPs) as building blocks due to their balance between computational
efficiency and flexibility. Consequently, more and more commercial platforms are available for
ASIP architecture exploration and design. These platforms comprise retargetable software de-
velopment tools (C-compiler, assembler, linker, simulator etc.) that can be quickly adapted to
varying target processor configurations. Such tools are usually driven by a processor model given
in a dedicated Architecture Description Language (ADL). Advanced ADLs are even capable of
generating the system interfaces and a synthesizable hardware model from the same specification.
The most challenging task designing an ADL, though, is to capture the architectural information
needed for the tool generation in an unambiguous and consistent way. This is particularly difficult
for compiler and simulator as they essentially need both the information about the instruction’s
semantics but from different points of view. The compiler, more specifically the compiler’s code
selector, needs to know what an instructions does in order to select appropriate instructions for a
given piece of source code, while the simulator needs to know how the instruction is executed. In
practice it is quite difficult, if not impossible, to derive one information from the other. None of the
existing ADLs – if compiler generation is supported at all – solves this problem in a sophisticated
manner. Either redundancies are introduced or the language’s flexibility is sacrificed.
Another challenge in this context is retargetable compilation for high-level programming lan-
guages like C/C++. Meanwhile, compilers became a necessity in order to attain high software
development productivity and to cope with the ever growing complexity of today’s applications.
Retargetable C compilers however, are often hampered by their limited code quality as compared
to hand-written compilers or assembly code since there is usually a trade-off between the com-
v
vi
piler’s flexibility and the quality of compiled code. In order to narrow the code quality gap this
demands flexible retargetable optimization techniques for common architectural features which
can be quickly adapted to varying target processor configurations.
This thesis presents a novel technique for extracting the code selector description fully auto-
matically from ADL processor models. The approach is based on the Language for Instruction
Set Architectures (LISA) ADL using a language extension for instruction semantics description.
This enables the automatic generation of C compilers from a LISA processor description with-
out loosing flexibility or introducing inconsistencies. In this way, a high speedup in compiler
generation is achieved, that contributes to a more efficient ASIP design flow. The feasibility of
the approach is demonstrated for several contemporary embedded processors. Furthermore, two
popular architectural classes are selected which demand for specific code optimization techniques,
namely processors equipped with SIMD instructions and those with Predicated Execution sup-
port. This thesis implements these specific techniques such that retargetability and high code
quality within the given processor class are obtained. Moreover, to ease the manual creation of
dedicated optimizations on the assembly level, this thesis implements a new retargetable assem-
bler which provides an application programmer interface for user defined code optimizations like
e.g. a peephole optimizer.
Acknowledgements
This thesis is the result of more than 5 years of work during which I have been accompanied
and supported by many people. It is now my great pleasure to take this opportunity to thank them.
First and foremost, I would like to thank my thesis supervisor Professor Rainer Leupers for
providing me with the opportunity to work in his group, and for his important advice and
constant encouragement throughout the course of my research. He always left me a lot of freedom
and contributed much to an enjoyable and productive working atmosphere. I am also thankful
to Professor Gerd Ascheid and Professor Heinrich Meyr. Their comments often unveiled new
interesting aspects and perspectives. I want to thank all of them for the lessons they gave me
on the importance of details for the success of an engineering or scientific project. It has been a
distinct privilege for me to work with them. Also I would like to thank Professor Sabine Glesner
for her competent and helpful feedback during her review of my thesis.
There are a number of people in my everyday circle of colleagues who have enriched my profes-
sional life in various ways. I am particularly indebted to my colleagues Jiangjiang Ceng, Oliver
Wahlen and Gunnar Braun, who worked together with me on the Compiler Designer project.
Without their contributions, without their support, and without the inspiring working atmosphere
this work would have been impossible. I am also indebted to Felix Engel for many stimulating
discussions and the excellent cooperation in the SIMD project. Life would be bleak without
all the nice and funny moments I had with my co-students during all these years. I thank all of you.
I was fortunate to have enthusiastic support from students who worked with me towards their
theses. Without their contributions I could never have realized this work. I sincerely offer my
gratitude to Gerrit Bette, Felix Engel, Andrey Gavrylenko and Christoph Schumacher.
I am also grateful to Hanno Scharwa¨chter, Torsten Kempf and Stefan Kraemer who were patient
vii
viii
and brave enough to carefully proofread this thesis. Their constructive feedback and comments
at various stages have been significantly useful in shaping the thesis upto completion.
At last, I would like to thank the people who I care most in the world, my family and Wibke. I
would like to thank Wibke for the many sacrifices she has made to support me in undertaking my
doctoral studies. By providing her steadfast support in hard times, she has once again shown the
true affection and dedication she has always had towards me. Finally, my biggest thanks go to
my parents Hans and Inge without whom I would not be sitting in front of my computer typing
these acknowledgements lines. I owe my parents much of what I have become. I dedicate this
work to them, to honor their love, patience, and support throughout my entire studies.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 ASIP Design Methodology 7
2.1 ASIP Design Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Compiler-in-the-loop Architecture Exploration . . . . . . . . . . . . . . . . . . . . 9
2.3 Design Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 A Short Introduction to Compilers 15
3.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Compiler Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Compiler Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Data- and Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Code Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.4 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.5 Code Emitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Retargetable Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Related Work 31
4.1 Instruction-set centric ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Architecture centric ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Mixed-level ADLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ix
x Contents
4.4 Other related approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Processor Designer 43
5.1 The LISA Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Compiler Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Code Selector Description Generation 53
6.1 The Semantic Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 SEMANTICS Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.1 Micro-operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.2 Modeling Complex Operations . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 Semantics Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Code Selector Description Generation . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.1 Nonterminal Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.2 Mapping Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Compiler Designer Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7 Results for SEMANTICS based Compiler Generation 69
7.1 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Mapping Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Compiler Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.1 PP32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.2 ST220 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8 SIMD Optimization 79
8.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2 SIMD Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.1 Basic Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.2.3 Alignment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2.4 SIMD Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.5 Strip Mining and Loop Peeling . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.6 Scalar Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.2.7 The Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.2.8 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Contents xi
8.2.9 The Unroll-and-Pack based SIMDfyer . . . . . . . . . . . . . . . . . . . . . 89
8.2.10 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 Retargeting the SIMD Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3.1 SIMD Candidate Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3.2 SIMD-Set Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9 Predicated Execution 103
9.1 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.1 Implementation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.2 Probability Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.3.3 Cost Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3.4 Selecting the best Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.3.5 Splitting Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.4 Retargeting Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.5 Code Generation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10 Assembler Optimizer 125
10.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.2 Application Programmer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.4 Peephole Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.4.1 Replacement Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11 Summary and Outlook 133
A Semantics Section 137
A.1 Semantics Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.1.1 Mode Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.1.2 Assignment Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.1.3 IF-ELSE Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.1.4 Non-Assignment Statements . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.2 Micro-operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xii Contents
A.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.2.2 Group of arithmetic operators . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.2.3 Group of logic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.2.4 Group of shifting operators . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.2.5 Group of zero/sign extension operators . . . . . . . . . . . . . . . . . . . . 153
A.2.6 Others/Intrinsic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.2.7 Affected flag declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.2.8 General bit specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3 SEMANTICS Section Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3.1 Grammar Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3.2 SEMANTICS Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
B CoSy Compiler Library Grammar 163
B.1 Grammar Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.2 Global Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.3 Basic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.3.1 CoSy IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.3.2 Rule Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.3.3 CoSy Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.3.4 Nonterminal Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.3.5 Control Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.3.6 Read/Write Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.3.7 Scratch Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.3.8 Semantics Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.3.9 Node Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.3.10 Result Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.4 Semantics Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
B.5 Compiler Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.5.1 Assignment Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.5.2 Label Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.5.3 IF-ELSE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.5.4 Non-Assignment Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.5.5 Micro-Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.5.6 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.6 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Bibliography 173
Chapter 1
Introduction
1.1 Motivation
Digital information technology has revolutionized the world during the last few decades. Today
about 98 percent of programmable digital devices are actually embedded [117]. These embedded
systems have become the main application area of information technology hardware and are the
basis to deliver the sophisticated functionality of today’s technical devices. As shown in Figure
1.1(a), current forecasts predict a worldwide embedded system market of $88 billion in 2009.
0
5
10
15
20
25
30
35
40
America Europe Japan Asia-Pacific
$
B
il
li
o
n
s
0%
5%
10%
15%
20%
25%
2004
2009
AAGR%
(a) Global embedded systems revenue and average an-
nual growth rate (AAGR) [91]
0,8 1 2 3 8
10
20 25
32
43 47
50 55
0
50
100
150
200
250
300
1993 1995 1997 1999 2001 2003 2005
AvailableGates
Used Gates
Millions of Gates
Design Productivity GapDesign Productivity Gap
(b) Crisis of complexity [193]
Figure 1.1:
Over the past few years, the ever increasing complexity and performance requirements of new
1
2 Chapter 1. Introduction
wireless communications, automotive and consumer electronics applications are changing the way
embedded systems are designed and implemented today. In conformity with Moore’s Law [88] one
driving force is the rapid progress in deep submicron process technologies. Chip designers and
manufacturers have constantly pushed the envelope of technological and physical constraints. In
fact, designers have more gates at their disposal than ever before. However, current mainstream
embedded system designs are not using at least 50 percent of the silicon area available to them
(Figure 1.1(b)). The growth in design complexity threatens to outpace the designer’s productivity,
on account of unmanageable design sizes and the need for more design iterations due to deep-
submicron effects. This phenomenon is also referred to as crisis of complexity [91] and comes
along with exponentially growing Non-Recurring Engineering (NRE) costs (Figure 1.2) to design
and manufacture chips. Understandably, these costs only amortize for very large volumes or high
end products.
Figure 1.2: Projected embedded system design cost model [109]
Consequently, more and more Application Specific Integrated Circuits (ASICs) are replaced by
programmable processors. Such processor platforms extends the product life-cycle and achieves
greater design reuse via software, thereby reducing development times and NRE costs. Moreover,
the flexibility of software can be used to create design derivates, to make functional corrections
due to process defects, and to provide performance improvements via updates.
Meanwhile, the high degree of integration offered by today’s semiconductor technology permits
increasingly complex systems to be realized in a single programmable System-on-Chip (SoC).
Current SoC designs employ several programmable processor cores, memories, ASICs, and other
1.1. Motivation 3
peripherals as building blocks. It is conjectured that by the end of the decade SoCs feature
hundreds of heterogeneous processor cores connected by a network-on-chip.
In order to efficiently explore the huge design space, tools and methodologies that offer the next
level of productivity required for successful SoC design are needed. This has lead to significant
research activities in the field of Electronic System Level (ESL) design. ESL design automation
tools provide the ability to quickly assemble, simulate and analyze alternative architectures. The
ultimate goal is to find the optimal combination of components for the given application domain
within a short time-to-market window. One piece in this puzzle is to rightly balance flexibility
vs. performance for each system component.
On one side of the flexibility vs. performance spectrum are General Purpose Processors (GPPs).
They offer high programmability and low design time, but may not satisfy area and performance
challenges. On the other side of the spectrum are ASICs. They can be easily optimized for
the given application, but, naturally, provide almost no flexibility and suffer from a lengthy and
costly design process. Therefore, an increasing number of embedded SoC designs employ Applica-
tion Specific Instruction-set Processors (ASIPs) [26, 145, 116] as efficient implementation vehicles.
They provide the best of both worlds, i.e. high flexibility through software programmability and
high performance through specialization. However, finding the optimal balance between flexibility,
performance, and energy efficiency constraints requires a thorough architecture exploration. This
process demands software development tools in order to efficiently map application programs to
varying ASIP configurations. In particular, the availability of a compiler translating high-level
programming languages to assembly code became inevitable. Embedded processors have been
traditionally programmed in assembly languages due to efficiency reasons. Considering the in-
creasingly growing software content of SoCs (Figure 1.3(b)), this is a time consuming and error
prone process which is no longer feasible given today’s tight time-to-market constraints. Further-
more, compiler-in-the-loop design space exploration helps to understand the mutual dependencies
between processor architectures, the respective instruction-set, compilers and the resulting code
[171]. Otherwise the result might be a strong compiler unfriendly architecture leading to an
inefficient application design in the end.
Nowadays retargetable compilers are widely used for architecture exploration since they can be
quickly adopted to varying processor configurations. Unfortunately, such compilers are often ham-
pered by their limited code quality as compared to hand-written compilers or assembly code due to
the lack of dedicated optimizations techniques. In order to narrow the code quality gap this needs
generalized optimization techniques for those architectural features which are often recurring in
ASIP design. This achieves retargetability and high code quality for a whole target processor class.
A complete compiler-in-the-loop architecture exploration as shown in Figure 1.3(a) also demands
4 Chapter 1. Introduction
application
.c
application
.c
Compiler
Assembler
Linker
Simulator
Profiler
(a) Compiler-in-the-loop architecture explo-
ration
0
4
8
12
16
20
24
28
32
36
1998 2000 2002 2004 2006 2008
M
il
li
o
n
L
in
es
o
fC
o
d
e
(M
L
O
C
) Approx.
44%/ year
Approx.
44% / year
(b) Software complexity [76]
Figure 1.3:
assembler, linker, simulator and profiler which, naturally, have to be retargetable as well. This
lead to the development of Architecture Description Languages (ADLs) which enable the auto-
matic generation of the complete software toolkit (or at least components thereof) from a single
processor model. The high degree of automation reduces the design effort significantly and hence,
allows the designer to explore a larger number of architectural alternatives. The most challenging
task designing an ADL, though, is to capture the architectural information needed for the tool
generation in an unambiguous and consistent way. This is particularly difficult for compiler and
simulator as they essentially need both the information about the instruction’s semantics but
from different points of view. The compiler, more specifically the compiler’s code selector, needs
to know what an instructions does in order to select appropriate instructions for a given piece
of source code, while the simulator needs to know how the instruction is executed. In practice
it is quite difficult, if not impossible, to derive one information from the other. None of the
existing ADLs – if compiler generation is supported at all – solves this problem in a sophisticated
manner. Either redundancies are introduced or the language’s flexibility is sacrificed. Moreover,
the specification of compiler relevant information mostly requires in-depth compiler knowledge.
This particularly applies for the code selector specification, the largest part of the compiler
description. So far, there is almost no support to generate code selector descriptions automatically.
This thesis presents a solution to the aforementioned retargetable compilation problems. A novel
technique is developed for extracting the code selector description fully automatically from an
ADL processor model. The approach is based on the LISA ADL [13] using a language extension
for instruction semantics description. This enables the automatic generation of both C compil-
ers and simulator from a single processor description without loosing flexibility or introducing
inconsistencies. In this way, a high speedup in compiler generation is achieved, that contributes
1.2. Outline of the Thesis 5
to a more efficient ASIP design flow. The feasibility of the approach is demonstrated for several
contemporary embedded processors.
In order to improve the code quality of the generated compilers, retargetable optimizations for two
common ASIP features, namely Single Instruction Multiple Data (SIMD) support and Predicated
Execution, are presented. Several representative RISC cores and VLIW architectures are used as
driver architectures to obtain code quality results. In this way, the code quality of the generated
compilers for architectures equipped with at least one of these features can be significantly im-
proved. Furthermore, a new retargetable assembler is implemented supporting an interface for the
implementation of code optimizations. This allows the user to quickly create custom low-level
optimizations. A scheduler and peephole optimizer are built as demonstrators.
As a result, this thesis presents an integrated solution to enable a complete and retargetable path
from a single LISA processor model to a highly optimizing C compiler and assembler. This com-
pletes LISA’s already established capabilities such that efficient compiler-in-the-loop architecture
exploration becomes broadly feasible.
1.2 Outline of the Thesis
This thesis is organized as following. Chapter 2, provides a background covering the necessity of
architecture description formalisms and compiler-in-the-loop architecture exploration. Afterwards,
Chapter 3 gives a short introduction to compiler construction where the most important concepts
required for the scope of this thesis are summarized. Chapter 4 describes the related work in the
field of compiler aided processor design. The advantages and drawbacks of various approaches are
also clearly mentioned. Surveys of relevant publications specifically related to individual chapters
of this thesis are given at the beginning of the corresponding chapters. The work presented in this
thesis is integrated into the industry proven Processor Designer ASIP design platform. The related
Language for Instruction-Set Architectures (LISA) ADL and the current C compiler generation
flow are elaborated in Chapter 5, whereas Chapter 6 presents a novel technique to generate
the code selector description fully automatically from a LISA processor description. Chapter
7 provides an analysis of the code quality produced by the generated compilers. Afterwards,
Chapter 8 and 9 present two high-level retargetable code optimizations. More specifically, an
optimization for the class of processors with SIMD support and Predicated Execution respectively.
Chapter 10 concentrates on a retargetable assembler for the quick implementation of user defined
assembly-level optimizations. Chapter 11 finally summarizes the major results of this work and
gives an outlook to future research. Appendix A contains an overview of the developed LISA
language extensions and Appendix B provides the formal description of the database as used for
code selector generation.
Chapter 2
ASIP Design Methodology
The design of an ASIP is a challenging task due to the large number of design options. The
competing design decisions such as flexibility, performance and energy consumption need to be
weighted against each other to reach the optimal point in the entire design space. Moreover,
the increasing software complexity of today’s SoCs requires a shift from traditional assembly
programming to high-level languages to boost the designer’s productivity. As a result, processor
designers demand an increasing support from the design automation tools to explore the design
space and rightly balance the flexibility vs. performance trade-off.
Section 2.1 firstly presents the four major phases in ASIP design. Afterwards, Section 2.2 elabo-
rates on the benefits and issues of compiler-in-the-loop architecture exploration. Finally, Section
2.3 presents prominent ASIP design methodologies. A survey of different ASIP design environ-
ments is given in [152].
2.1 ASIP Design Phases
The design of an ASIP is a highly complex task requiring diverse skills in different areas. The
design process can be separated into four interrelated phases (Figure 2.1):
Architecture exploration: The target application is mapped onto a processor architecture in
an iterative process that is repeated until a best fit between architecture and application is
obtained. According to Amdahl’s law [79] the application’s hot spots need to be optimized
to achieve high performance improvements and hence, constitute promising candidates for
dedicated hardware support. In order to identify those hot spots, profiling tools such as
[214, 130] are employed. Based on this hardware/software partitioning the Instruction-Set
Architecture (ISA) is defined in a second step. Afterwards, the micro-architecture needs to
be designed that implements the ISA. The whole process requires an architecture specific
set of software development tools (compiler, assembler, linker, simulator and profiler). Un-
7
8 Chapter 2. ASIP Design Methodology
Architecture
Exploration
Architecture
Exploration
System
Integration
System
Integration
Architecture
Implementation
Architecture
Implementation
Software
Application
Design
Software
Application
Design
Figure 2.1: ASIP design phases
fortunately, every change to the architecture specification requires a complete new set of
software development tools.
Architecture implementation: The specified processor is converted into a synthesizable Hard-
ware Description Language (HDL) model. For this purpose, languages such as VHDL [107]
or Verilog [106] are employed. This model can then be further used for a standard synthesis
flow (e.g. Design Compiler [227]). With this additional transformation, quite naturally, con-
siderable consistency problems can arise between the architecture specification, the software
development tools, and the hardware implementation.
Software application design: Software designers need a set of production-quality software de-
velopment tools for efficient application design. However, the demands of the software appli-
cation designer and the hardware processor designer place different requirements on software
development tools. For example, the processor designer needs a cycle/phase-accurate simu-
lator for hardware-software partitioning and profiling which is very accurate, but inevitably
slow. The application designer in contrast demands more simulation speed than accuracy.
At this point, the complete set of software development tools is usually re-implemented by
hand which leads to consistency problems.
System integration and verification: The designed ASIP must be integrated into a system
simulation environment of the entire SoC for verification. Since the interaction of all SoC
components may have an impact on the processor performance this provides more accurate
results as compared to an instruction-set simulator. However, in order to integrate the
software simulator, co-simulation interfaces must be developed. Again, manual modifications
of the interfaces are required with each change of the architecture.
In traditional ASIP design, these phases are processed sequentially and are assigned to different
design groups each with expert knowledge in the respective field. Design automation – if available
at all – is mostly limited to the individual phases. Moreover, results in one phase may impose
2.2. Compiler-in-the-loop Architecture Exploration 9
modifications in other phases. As a result, the complexity of design team interactions and commu-
nications necessary to successfully undertake a SoC-based design is a significant time consuming
factor. What makes this even more challenging is the large number of design alternatives which
need to be weighted against each other. Consequently, the designer’s productivity becomes the
vital factor for successful products due to the complexity and tight time-to-market constraints. As
a result, there is a strong interest in comprehensive design methodologies for efficient embedded
processor optimization and exploration.
2.2 Compiler-in-the-loop Architecture Exploration
Much of the functionality in a SoC is implemented in software due to a number of reasons:
The flexibility of software offers wide design reuse and compatibility across applications. It
is conjectured that the amount of software in embedded systems roughly doubles every two
years [76]. As a result, a rapidly increasing amount of software has to be validated and/or
developed. This involves not only essential hardware drivers but also complete operating systems.
Furthermore, new applications, exploiting the new hardware capabilities, need to be developed
before the end products based on the SoC can be sold.
Embedded processors, however, have been traditionally programmed in assembly languages
due to efficiency reasons. Considering the increasing complexity of applications assembly
programming is no longer feasible given today’s short time-to-market windows. Obviously,
such requirements can be much better met by using High-Level Language (HLL) compilers.
In the context of embedded systems the C programming language [37] is widely used. It is a
well-tried programming language which allows a very low-level programming style at a stretch.
Additionally, this enables a broad design reuse since there exists already a large amount of
industry standards and legacy code in C. Unfortunately, designing a compiler is a complex task
which demands expert knowledge and a large amount of human resources. As a result, compilers
are often not available for newly designed processors. Clearly, this increases the probability
designing a strong compiler unfriendly architecture which leads to an inefficient application
implementation in the end. In fact, many in-house ASIP design projects suffer from the late
development of the compiler. Compiler designers often have severe difficulties ensuring good
code quality due to instruction-sets that have primarily been designed from a hardware designers
perspective. On the other hand, a compiler friendly instruction-set and architecture might not be
entirely suitable to support the hardware designer’s effort meeting constraints such as area and
power consumption. Therefore, compiler-in-the-loop architecture exploration is crucial to avoid a
compiler and architecture mismatch right from the beginning and to ensure efficient application
design for successful products.
10 Chapter 2. ASIP Design Methodology
The inherently application-specific nature of embedded processors leads to a wide variety of
embedded processor architectures. Understandably, developing the software tools, in particular
the compiler, for each processor is costly and extremely time-consuming. Therefore, retargetable
C compilers have found significant use in ASIP design in the past years since they can be quickly
adapted to varying processor configurations. This is also a result of the increasing tool support
for automatically retargeting a C compiler based on formalized processor descriptions [200].
In compiler-in-the-loop architecture exploration the compiler plays a key role to obtain exploration
results. Due to the ambiguity of the transformation of C applications to assembly code it is
possible to quickly evaluate fundamental architectural changes with minimal modifications of
the compiler [171]. In this way designers can meaningfully and rapidly explore the design
space by accurately tracking the impact of changes to the instruction-set, instruction latencies,
register file size, etc. This is an important piece in the puzzle to better understand the mutual
dependencies between micro-architecture design, the respective instruction-set, compilers and
the achieved code quality. What is most important in this context is the specification of the
compiler’s code selector. It basically describes the mapping of the source code to an equivalent
sequence of assembly instruction and hence, significantly affects the final ISA definition (i.e. the
software/hardware partitioning). However, the success of compiler aided architecture exploration
strongly depends on a flexible C compiler backend that is generated from the processor description.
Even though retargetable compilers have found significant use in ASIP design in the past years,
they are still hampered by their limited code quality as compared to hand-written compilers
or assembly code. This is actually no surprise, since higher compiler flexibility comes at the
expense of a lower amount of target-specific code optimizations. Since such compilers can
only make few assumptions about the target machine, it is, understandably, much easier to
support machine-independent optimizations rather than techniques exploiting novel architectural
features of emerging embedded processors. However, the lower code quality of the compilers is
usually acceptable considering that the C compiler is available early in the processor architecture
exploration loop. Thus, once the ASIP architecture exploration phase has converged and an
initial working compiler is available, it must be manually refined to a highly optimizing compiler
or the application’s hot spots must be manually replaced by assembly programs – both are time
consuming tasks. One way to reduce the design effort is to provide retargetable optimizations for
those architectural features that characterize a processor class. In this way retargetability and
high code quality for a this particular class of processors is achieved. For instance, retargetable
software pipelining support is less useful for scalar architectures, however, it is a necessity for
the class of VLIW processors, and for this class it can be designed in a retargetable fashion.
Retargetable optimization techniques for two common ASIP features are proposed in this thesis
to further improve the code quality of retargetable compilers.
2.3. Design Methodologies 11
A retargetable assembler, linker, simulator and profiler completes the required software develop-
ment infrastructure. Needless to say that keeping all tools manually consistent during architecture
exploration is a tedious and error prone task. Additionally, they must also be adapted to modifica-
tions performed in the other design phases. As a result, different automated design methodologies
for efficient embedded processor design have evolved. Two contemporary approaches are presented
in the next section.
2.3 Design Methodologies
One solution to increase the design efficiency is to significantly restrict the design space of the
processor. More specifically, such design environments are limited to a predefined processor tem-
plate whose software tools and architecture can be configured to a certain extend (Figure 2.2(a)).
Prominent examples for this approach are the Xtensa [191] and the ARCtangent [35] processor
families. Considering that all configuration options are pre-verified and the number of possible
processor configurations is limited, the final processor can be completely verified. However, this
comes at the expense of a significantly reduced design space which imposes certain limitations.
The coarse partitioning of the design space makes it inherently difficult to conceive irregular ar-
chitectures suited for several application domains. Furthermore, certain settings of the template
may also turn out to be redundant or sub-optimal, like memory interface or the register file archi-
tecture for instance. Another limitation is imposed by the support for custom instructions. Such
instructions must be typically given in a HDL description and hence, cannot be directly utilized
by the compiler.
Another, more flexible concept for ASIP design is based on Architecture Description Languages
(ADLs). Such languages have been established recently as a viable solution for efficient ASIP
design (Figure 2.2(b)). ADLs describe the processor on a higher abstraction level, e.g. instruction
accurate or cycle accurate, to hide implementation details. One of the main contribution of
such languages is the automatic generation of the software toolkit from a single ADL model
of the processor. Advanced ADLs are even capable of generating the system interfaces and a
synthesizable HDL model from the same specification. This eliminates the consistency problem
of the traditional ASIP design flow since changes to the processor model directly lead to a new
and consistent set of software tools and hardware implementation. In this way they provide a
systematic mechanism for a top-down design and validation of complex systems. The high degree
of automation reduces the design effort significantly and thus, allows the designer to explore a
larger number of architectural alternatives. Early ADLs, such as ISPS [139], were used for the
simulation, evaluation and synthesis of computers and other digital systems. Contemporary ADLs
can be classified into three categories [99] based on the kind of information an ADL can capture:
Instruction-set centric: Instruction-set centric languages have been designed with the genera-
12 Chapter 2. ASIP Design Methodology
Floatingpoint support
Multiplier
Configuration options Extend processor
Configurable processor
Processor generatorProcessor generator
Compiler
HDL model System
models Assembler
Linker
Simulator
Application
.c
Application
.c
OK ?
• Add register files
• VLIW data path
• Multi-cycle FUs
• …
• Add register files
• VLIW data path
• Multi-cycle FUs
• …Zero overhead loops
(a) Configurable processor design
Application
.c
Application
.c
Simulator
& Profiler
Simulator
& Profiler
Assembler
& Linker
Assembler
& Linker
Design
criteria
met?
CompilerCompiler
ADL
model
ADL
model
HDL
model
System
models
(b) ADL based architecture exploration
Figure 2.2:
tion of a HLL compiler in mind. Consequently, such languages must capture the instruction-
set behavior (i.e. syntax, coding, semantic) of the processor architecture, whereas the in-
formation about the detailed micro-architecture (i.e. pipeline stages, memories, buses, etc.)
does not need to be included. However, it is hardly possible to generate HDL models from
such specifications. Typical representatives for this kind of ADLs are nML [8, 124], ISDL
[86], and CSDL [165].
Architecture centric: These kind of ADLs capture the structure in terms of architectural com-
ponents. Therefore, they are well-suited for processor synthesis. But on the other hand,
these languages typically have a low abstraction level leading to a quite detailed architec-
ture specification. Unfortunately, it is quite difficult, if not impossible, to extract compiler
relevant information (e.g. instruction’s semantic) from such informal models. Prominent
examples for this category of ADLs are MIMOLA [211], UDL/I [241], and AIDL [230].
Combination of both: These so called mixed-level description languages [11] describe both,
the instruction-set behavior and the structure of the design. This enables the generation
of software tools as well as a synthesizable hardware model. However, capturing both in-
formation can lead to a huge description which is difficult to maintain. Additionally, such
languages can suffer from inconsistencies due to duplicated informations. Certain architec-
tural aspects need to be described twice for e.g. once for compiler generation and once for
processor synthesis. ADLs belonging to this group are MDes [119], RADL [136], FlexWare
2.4. Synopsis 13
[183], MADL/OSM [252], EXPRESSION [178], and LISA [13].
Obviously, designing an ADL that captures all aspects of ASIP design in an unambiguous and
consistent way is a challenging task. This is further aggravated by the fact that most ADLs
have originally been designed to automate the generation of a particular component and have
then been extended to address the other aspects. As a result, ADLs are often well-suited for the
purpose they have been designed for, but impose major restrictions on, or are even incapable of
the generation of the other components. This is true in particular for the generation of compiler
and simulator. Therefore, another focus of this thesis are methodologies to generate compiler and
simulator from a single ADL specification without limiting its flexibility or architectural scope. A
detailed discussion of different ADLs is given in Chapter 4.
2.4 Synopsis
• Finding the optimal balance between flexibility and performance requires the evaluation of
different architectural alternatives.
• HLL compilers are needed in the exploration loop to cope with the growing amount of
software and to avoid hardware/software mismatches.
• The widely employed retargetable compilers suffer from their lower code quality as compared
to hand-written compilers or assembly code.
• For quick design space exploration methodologies using pre-defined processor templates or
ADL descriptions are proposed.
• ADL support for the automatic generation of the complete software tool chain (in particular
compiler and simulator) is currently not satisfactory.
• Primary focus of this thesis is the generation of C compilers from ADL processor models
and retargetable optimization techniques to narrow the code quality gap.
Chapter 3
A Short Introduction to Compilers
This chapter summarizes briefly some basic terms and definitions of compiler construction as well
as the underlying concepts. Only the concepts that are required for the understanding of this
thesis are presented; detailed surveys can be found in [221], [205], and [3].
3.1 General Overview
A compiler is a program that translates a program written in one language (the source language)
into a semantically equivalent representation in another language (the target language). Over the
years new programming languages have emerged, the target architectures continue to change, and
the input programs become ever more ambitious in their scale and complexity. Thus, despite the
long history of compiler design, and its standing as a relatively mature computing technology, it
is still an active research field. However, the basic tasks that any compiler must perform remain
essentially the same. Conceptually, the translation process can be subdivided into several phases
as shown in Figure 3.1. The first is the analysis phase, often called the frontend, which creates an
Intermediate Representation (IR) of the source program. On this specification, many compilers
apply a sequence of high level, typically machine independent optimizations to transform the
IR into a form that is better suitable for code generation. This includes tasks such as common
subexpression elimination, constant folding, constant propagation etc. A very common set of
high level optimizations is described in [1]. This is also referred to as the midend of the compiler.
Afterwards, the synthesis phase, or the backend respectively, constructs the desired target program.
Frontend and backend are presented in more detail in the following sections.
15
16 Chapter 3. A Short Introduction to Compilers
Code
Selection
Register
Allocation
Instruction
Scheduling
Code
Emitter
Lexical
Analysis
Syntax
Analysis
Source
Code
Source
Code
Semantic
Analysis
Assembly
Code
Assembly
Code
Optimizations
Control &Data
Flow Analysis IR
CFGDFG
Frontend
Backend
Midend
Figure 3.1: Common compiler phases
3.2 Compiler Frontend
The first phase in the frontend is the lexical analysis. A scanner breaks up the program into
constituent pieces, called tokens. Each token denotes a primitive element of the source language,
e.g. a keyword, an identifier, a character etc. Generally, most of these elements can be represented
by regular expressions, which can be parsed by Finite State Machines (FSMs). A FSM consists
of a finite number of states and a function that determines transitions from one state to another
as symbols are read from an input stream (i.e. the source program). The machine transitions
from state to state as it reads the source code. A language element (e.g. a keyword or an integer
number) is accepted if the machine reaches one of a designated set of final states. In this case, a
corresponding token is emitted and the machine returns to the initial state to proceed with the
next character in the stream. Given a list of regular expressions, scanner generators like GNU’s
FLEX [94] can produce C code for the corresponding FSM that can recognize these expressions.
3.1. Definition (Context free grammar). A context free grammar G is a tuple G =
(T,N,R, S), where T denotes a finite set of terminals (i.e. the set of possible tokens), N a fi-
nite set of nonterminals, and S ∈ N the start symbol. R is a relation from X to (T ∪ N)∗,
whereas X must be a member set of N .
The tokens are then further processed by the parser to perform a syntax analysis. Based upon
a context free grammar it identifies the language constructs and maintains a symbol table that
records the identifiers used in the program and their properties. The result is a parse tree that
represents a derivation of the input program from the start symbol S. If the token string con-
tains syntactical errors the parser may produce the corresponding error messages. Again, parser
generators are available (e.g. GNU’s BISON [93]) which can generate a C implementation from a
context free grammar specification.
3.3. Compiler Backend 17
Finally, a semantic analysis is performed which checks if the input program satisfies the semantic
requirements as defined by the source language. For instance, whether all used identifiers are
consistently declared and used. For practical reasons, semantic analysis can be partially integrated
into the syntax analysis using an attribute grammar [59], an “extended” context free grammar.
Such grammars allow the annotation of a symbol s ∈ (T ∪ N) with an attribute set A(s). An
attribute a ∈ A(s) stores semantical information about a symbol’s type or scope. Each grammar
rule r, with r ∈ R, using a can be assigned an attribute definition D(a). The attributes are
divided into two groups: synthesized attributes and inherited attributes. The former are used
to pass semantic information up the parse tree, while inherited attributes passing them down.
Both kinds are needed to implement a reasonable semantic analysis. Such attribute grammar
specifications can be further processed by tools like OX [126] (an extension of FLEX and BISON)
to finally create a parser with integrated semantic analysis.
The output IR format of the frontend is typically a list of expression trees or three-address code.
Generally, the frontend is not dependant on the target processor. Thus, an existing language
frontend can be combined with any target specific backend, provided that all of them use the
same IR format.
=
x +
* *
a b c 5
x= a * b + c * 5 ;
t1 = a * b ;
t2 = c * 5 ;
x = t1 + t2
C code Three address code Expression tree
Figure 3.2: IR format examples
3.3 Compiler Backend
The task of the backend is the code generation which consists of several subtasks. Since many of
them are known to be NP-complete [144] problems, i.e. solving such problems most likely requires
algorithms with exponential runtime, code generation typically relies on heuristics. Therefore
and due to software engineering reasons all code generation tasks are implemented by separate
algorithms. However, these tasks are usually interdependent, i.e. decisions made in one phase
impose constraints in subsequent phases. While this works well for regular architectures it typically
results in poor code quality for irregular architectures [247]. This is also known as the phase
coupling problem.
Before the different subtasks are presented in the following sections, several program represen-
tations essential for most code generation subtasks (and for most compiler optimizations) are
introduced first.
18 Chapter 3. A Short Introduction to Compilers
3.3.1 Data- and Control Flow Graph
The data- and control flow graph provide more detailed information about the program semantics
than the plain IR representation. Firstly, the control flow needs to be computed. Each function
is split into its basic blocks.
3.2. Definition (Basic Block). A basic block B = (s1, ..., sn) is a sequence of IR statements
of maximum length, for which the following conditions are true: B can only be entered at statement
s1 and left at sn. Statement s1 is called Leader of the basic block. It can either be a function entry
point, a jump destination, or a statement that follows immediately after a jump or a return.
Consequently, if the first statement of a basic block is executed, then all other statements are
executed as well. This allows certain assumptions about the statements in the basic block which
enable the rearrangement of computations during scheduling for instance. Basic blocks can be
easily computed by searching for IR nodes that modify the control flow of the program (e.g. goto
and return statements). Once the basic blocks have been identified the control flow graph can be
constructed.
3.3. Definition (Control Flow Graph). A Control Flow Graph (CFG) of a function F is
a directed graph GF = (VF , EF ). Each node v ∈ VF represents a basic block, and EF contains an
edge (v, v′) ∈ VF ×VF if v
′ might be directly executed after v. The set of successors succ of a basic
block B is given by succB = {v ∈ VF | (b, v) ∈ EF} and the set of predecessors pred of a basic
block B is given by predB = {v ∈ VF | (v, b) ∈ EF}
The obvious edges are those resulting from jumps to explicit labels as the last statement sn of a
basic block. Furthermore, if sn is a conditional jump or a conditional return then a fallthrough
edge to the successor basic block is additionally created. In certain cases sn is not a jump nor a
return. Thus, in case a successor block exists and its first statement follows immediately after sn
in the IR representation, an edge to the successor block is created. Blocks without any outgoing
edges have a return statement at the end. In case the resulting CFG contains unconnected basic
blocks, there is unreachable code which can be eliminated by a dead code elimination optimization
without changing the program semantics.
While the CFG stores the control flow on basic block level, another important data structure deals
with the data dependencies between statements.
3.4. Definition (Data Dependency). A statement sj of a basic block B = (s1, ..., sn) is data
dependent on statement si, with i < j, if si defines a value that is used by sj (i.e. si needs to be
executed before sj).
A Data Flow Analysis (DFA) in its simplest form computes the data dependencies just for single
basic blocks and thus, is referred to as local DFA. Basically, for each statement a data flow equation
is created. Solving the resulting system of equations gives the data flow information for the basic
block. The result is stored in a Data Flow Graph DFG.
3.3. Compiler Backend 19
3.5. Definition (Data Flow Graph). A Data Flow Graph (DFG) for a basic block B is a
directed acyclic graph GB = (VB, EB) where each node v ∈ VB represents in an input operand
(constant, variable), an output (variable) operand or an IR operation. An edge e = (vi, vj) ∈
EB ⊂ VB × VB indicates that the value defined by vi is used by vj.
A DFG is called Data Flow Tree (DFT) if no node has more than one outgoing edge, i.e. there are
no common subexpressions. Typically, DFTs build the input data for many popular code selection
techniques.
In practice, compilers perform a DFA for an entire function, called global DFA, since local DFA
hinders many optimization opportunities. Suppose, a basic block has several outgoing control
flow edges, i.e. a definition of a variable (e.g. initialized with a constant) may reach multiple uses,
possibly in different basic blocks. Thus, in order to exploit the full potential of e.g. constant
propagation, all uses reached by that definitions are required which can only be provided by a
global DFA. Typically, local DFA is embedded as a sub routine in the global DFA which iteratively
solves the data flow equations for an entire procedure.
3.3.2 Code Selection
Code selection is typically the first phase in the backend. Its task is to map the IR to a semantically
equivalent sequence of machine instructions. A common technique for code selection uses DFTs as
input and is based on tree parsing. This can be efficiently implemented by tree pattern matching
combined with dynamic programming [2]. The basic idea is to describe the instruction-set of the
target processor by a context free tree grammar specification.
3.6. Definition (Context free tree grammar). A context free tree grammar G is a tuple
G = (T,N, P, S, w), where T denotes a finite set of terminals, N a finite set of nonterminals, and
P ⊆ N × (N ∪ T )∗ a set of production rules. S ∈ N is the start symbol and w is a cost metric
P → R for the production rules.
In the context of tree pattern matching, T can be seen as the set of all IR nodes and N as some
sort of temporaries or storage location (e.g. registers or memory) to transfer intermediate results
either between or inside instructions. The cost metric describes the costs caused by executing the
corresponding instruction e.g. with regard to performance, code size or power consumption. The
target code is generated by reducing the DFT to a single node (or covering the DFT) by repeatedly
applying one of the production rules P , i.e. a subtree T can be replaced by a nonterminal n ∈ N
if the rule n→ T is in P .
As typical example for a tree grammar rule, consider the rule for an register to register ADD
instruction:
reg → PLUS(reg, reg){costs} = {actions} (3.1)
20 Chapter 3. A Short Introduction to Compilers
with reg ∈ N and PLUS ∈ T . If the DFT contains a subtree that matches a subtree whose root
is labeled by the operator “PLUS” and its left and right son are labeled with “reg”, it can be
replaced by reg. It should be noted here that both sons might also be the result of further tree
grammar rules which have been applied before. Each rule is associated with a cost and an action
section. The latter typically contains the code to emit the corresponding assembly instruction.
In might happen that more than one rule covers a subtree. A cover is optimal if the sum over all
costs of involved rules is minimal. This can be implemented by a dynamic programming approach,
i.e. the optimum solution is based on the optimum solution of (typically smaller) subproblems.
More specifically, a tree pattern matcher traverses the DFT twice:
In the first bottom-up traversal each node i of a DFT T is labeled with the set of nonterminals it
can be reduced to, the cheapest rule r ∈ P producing n and the total cost (i.e. the costs covering
the subtree rooted at i). This includes also those nonterminals which might be produced by a
sequence of rules. When the root node of T has been reached, the rule that produces the start
nonterminal S with minimum cost is known.
In a second top-down traversal, the pattern matcher exploits the fact that a rule for a node i also
implicitly determines the nonterminals the subtrees of i must be reduced to (otherwise the rule
could not have been applied to i). Thus, starting at the root node, it can now be determined
which nonterminals must be at the next lower level in T . Therewith for each nonterminal the
corresponding rule r can be obtained whose action section emits finally the instructions. This
traversal is recursively repeated until the leafs of T have been reached. Figure 3.3 illustrates this
process using the tree grammar specification in Table 3.1.
Rule Nr. Nonterminal Tree pattern Instruction Costs
1 stmt → ASSIGN(ADDR,reg1) STORE dst = src 1
2 reg1 → LOAD(ADDR) LOAD dst = src 1
3 reg1 → PLUS(reg1,reg2) ADD dest = src1, src2 1
4 reg1 → MULT(reg1,reg2) MUL dest = src1, src2 1
5 reg1 → MULT(reg1,imm) MULI dest = src1, src2 1
6 imm → CONST 0
7 reg2 → imm LOADI dst = src 1
8 reg1 → reg2 MOVE21 dst = src 1
9 reg2 → reg1 MOVE12 dst = src 1
Table 3.1: Tree grammar specification
Tree pattern matching finds an optimal set of instructions for a single DFT at linear time in
the number of DFT nodes. Furthermore, a number tools are available which can generate tree
pattern matchers from a target specific tree grammar specification. Examples of such so called
code generator generators are BEG [96], burg [44], iburg [43], lburg (code selector of the lcc
3.3. Compiler Backend 21
ASSIGN
PLUS
MULT
ADDRa ADDR cADDR b
CONST 5LOAD
MULT
LOAD LOAD
ADDR x
reg1:2:c=1
reg2:9:c=1+1
reg1:2:c=1
reg2:9:c=1+1
imm:6:c=0
reg1:8:c=0+1+1
reg2:7:c=0+1
reg1:4:c=1+2+1
reg2:9:c=1+2+1+1
reg1:5:c=1+0+1
reg2:9:c=1+0+1+1
reg1:3:c=3+2+1
reg2:9:c=3+2+1+1
stmt:1:c=6+1
reg1:2:c=1
reg2:9:c=1+1
Nonterminal:RuleNr:Cost
Selected rule
Figure 3.3: Tree pattern matching example for the statement x = a ∗ b+ c ∗ 5;
compiler [42]), OLIVE (code selector of the SPAM compiler [224]), and twig [2].
In case the IR takes the form of a Direct Acyclic Graph (DAG) (due to common subexpressions)
it is usually split into a forest of DFTs based on heuristics. While this works well for regular
architectures, for irregular architectures or architectures with special custom instructions this
may result in sub-optimal code quality. Typically, such architectures comprise instructions that
exceed the scope of a single DFT. Therefore, different approaches to DAG based code selection
have been developed like [141, 210]. Unfortunately, optimal code selection on DAGs is known
to be NP-complete. Thus, many approaches employ heuristics, impose several restrictions or are
mostly limited to small problem sizes in order to cope with the excessive runtime requirements.
The work in [98] presents a code generator generator, called cburg, for a DAG based code selector.
3.3.3 Register Allocation
The task of the register allocator is to assign variables and temporary values to a limited set of
physical machine registers. Registers are very expensive with regard to area and power consump-
tion. Therefore, many processor architectures implement only a small register file. Due to the
increasing gap between the processor’s speed and the memory access time the register allocation
must keep the largest possible number of variables and temporaries in registers to achieve good
code quality. In the following, the most important definitions and concepts of register allocation
are summarized.
3.7. Definition (life range). A virtual register r is live at a program point p, if it exist a path
in the control flow graph starting from p to an use of r on which r is not defined. Otherwise r is
dead at p.
22 Chapter 3. A Short Introduction to Compilers
3.8. Definition (interference graph). Let V denote a set of virtual registers. An undi-
rected graph G = (V,E) is called interference graph if for all v, w ∈ V the following condition
holds: v and w have intersecting life ranges.
State of the art techniques for register allocation are based on a graph coloring paradigm. The
notion of abstracting storage allocation problems to graph coloring dates from the early 1960s [219].
More specifically, the problem of register allocation is translated into the problem of coloring the
interference graph by K colors where K denotes the number of available physical registers. The
basic idea of the graph coloring method is based on the following observation: If G contains a
node n with degree d (i.e. the number of edges connected to n) with d < K, a color k from the
set of K colors can be assigned to n that is different from the colors of all its neighbors. The
node n is removed from G and a new graph G′ = G− n is obtained that, consequently, contains
one node and several edges fewer and the algorithm proceeds with the next node. This approach
leads to a step by step reduction of the interference graph. Since graph coloring is NP-complete,
heuristics are employed to search for a K-coloring. If such a coloring cannot be found for the
graph some values are spilled, i.e. values are kept in memory rather than in registers which results
in a new interference graph. This step is repeated until a K colorable interference graph is found.
An example is given in Figure 3.4.
Livein: g,h
1) a = A[g]
2) e = 2 * a
3) f = g + h
4) d = A[f]
5) b = e + d
6) c = d - b
7) f = b + c
Live out: b,c,f
Live in: g,h
1) a = A[g]
2) e = 2 * a
3) f = g + h
4) d = A[f]
5) b = e + d
6) c = d - b
7) f = b + c
Live out: b,c,f d
b f
c
e
g
h
a
xxx7)
xx6)
xx5)
xx4)
xxx3)
xxx2)
xx1)
hgfedcba
Figure 3.4: Code example, life ranges, interference graph and its coloring (K=3)
The first implementation of a graph coloring register allocator was performed by Chaitin et al.
[83, 82]. Later, a priority-based scheme for allocation using graph coloring has been described
[71, 72]. Almost all subsequent work is based on these approaches.
The register allocation algorithms can be further subdivided according to their scope. Local
register allocation, like [83], [71], work only on a single basic block at a time. In contrast, global
register allocation algorithms exceed basic block boundaries and take the control flow structure
of the program into account, e.g. an entire procedure or even a collection of procedures. Since the
latter is able to take execution frequencies of loop bodies, life ranges over basic block boundaries
and calling conventions into account, a better cost analysis can be performed to improve the spill
3.3. Compiler Backend 23
heuristics. Therefore, many register allocators today are global register allocators. Examples for
graph coloring based global allocators are [72, 176].
Of course, not all global allocation methods are based on graph coloring. Examples for different
approaches include the bin-packing algorithm [175] and the probalistic register allocation of [231].
Although graph coloring allocators can be implemented efficiently, they have a quadratic runtime
complexity. This makes them impractical whenever compile time is a major concern like in
dynamic compilation environments or Just-In-Time (JIT) compilers. For this domain, an allocator
with linear runtime and acceptable code quality, called linear scan allocator, has been proposed
[155]. The linear scan algorithm consists of the following four steps:
1. Order all instruction linearly.
2. Calculate the set of live intervals.
3. Allocate a register to each interval (or spill the corresponding temporary).
4. Rewrite the code with the calculated allocation.
The linear scan algorithm relies on a linear approximation of the instructions order to determine
simultaneously alive intervals. This order influences the extent and accuracy of live intervals and
hence, the quality of the register allocation. As investigated in [133], a depth-first ordering is the
optimal one.
After instruction ordering is done, the live intervals are computed. For temporaries outside of
a loop, the interval starts at the first definition of the register and ends at its last use. For
temporaries alive inside a loop, the interval must be extended to the end of the loop. Given live
variable information (e.g. via dataflow analysis [1]), live intervals can be computed easily with
one pass through the ordered instruction list. Intervals interfere if they overlap. The number of
overlapping intervals changes only at the start and end points of an interval. The computed live
intervals are stored in a list that is ordered in increasing start points to make the allocation more
efficient.
As defined in [155], given R available registers and a list of live intervals, the linear scan algorithm
must allocate registers to as many intervals as possible, but such that no two overlapping live
intervals are allocated to the same register. If n > R live intervals overlap at any point, then at
least n−R of them must be spilled. For allocation, the linear scan algorithm maintains a number
of sets:
1. The set of already allocated intervals, called Allocated.
2. The mapping of active intervals to registers is stored in the set named Active.
The algorithm starts with an empty Active set. For each newly processed live interval, the
algorithm scans Active from the beginning to the end and moves those intervals to Allocated
24 Chapter 3. A Short Introduction to Compilers
which end points precede the processed interval’s start point. Removing an interval from Active
makes the corresponding register again available for allocation. The processed interval’s start
point becomes the new start position for the algorithm and gets a physical register assigned that
is not used by any interval in Active. If all registers are already in use, one interval must be
spilled. The spill heuristics selects the interval with the highest end position.
[v1] spilled
[v2]R2
[v3] R1
[v4] R3
[v5] R1
[v1] spilled
[v2] R2
[v3] R1
[v4] R3
[v5] R1
Final allocation
[v1]
[v2]
[v3]
[v4]
[v5]
1 2 3 4 5 6 7
spilled
Current position
Active: {v2, v3, v4}
Spilled to memory: v1
Instruction
ordering
(1) LiveIn: v1, v2
(2) v3 = v1 – 1
(3) v4 = v2 x v3
(4) v5 = v3 + 8
(5) v2 = v5
(6) v1 = v2 x v4
(7) LiveOut: v1
(1) LiveIn: v1, v2
(2) v3 = v1 – 1
(3) v4 = v2 x v3
(4) v5 = v3 + 8
(5) v2 = v5
(6) v1 = v2 x v4
(7) LiveOut: v1
Figure 3.5: Linear scan allocation example
Figure 3.5 depicts an example. The live intervals shown in the middle correspond to the instruction
ordering on the left. Suppose the set of allocatable physical registers is R1, R2, R3. In the first
step, the interval V1 is processed and, since the Active list is empty, gets the physical register R1
assigned. Consequently, V1 is added to the Active list. When V2 is visited in the next step, V1 is
still live and another register R2 is assigned to V2 and added to Active. Afterwards, interval V3 is
processed and gets the last free physical register R3 assigned. Since no physical register is available
for V4, one interval must be spilled. The algorithm selects V1 for spilling because it has the highest
end position and removes it from the Active list. The example shows the corresponding state of
the intervals and the active list. The final allocation after processing all intervals is depicted on
the right.
A retargetable linear scan allocator for the CoSy environment [30] was implemented in [9] and
compared to the regular graph based register allocator. The results show an average speedup of
1.6–7.1 for the register allocation while attaining good code quality (average overhead in cycle
count/code size is within 1%–3%).
3.3.4 Instruction Scheduling
Most contemporary processors use pipelining to partially overlap the execution of instructions
or even Instruction Level Parallelism (ILP) to execute several instructions in parallel like Very
Long Instruction Word (VLIW) machines for instance. Generally, scheduling is the process of
reordering instructions in such a way that the maximum amount of parallelism among instructions
is exploited. Similar to register allocation, local schedulers work at the basic block level whereas
3.3. Compiler Backend 25
global scheduler deal with complete functions.
The scheduling process is limited by two major constraints [190]: Firstly, data hazards or control
hazards causing dependencies between instructions that force a sequential ordering and secondly
resource limitations, i.e. structural hazards, that force serialization of instructions requiring the
same resource. A dependency graph that captures these constraints constitutes the input for most
scheduling techniques.
3.9. Definition (Dependency Graph). A Dependency Graph (DG) is an edge-weighted di-
rected acyclic graph G = (V,E, type, delay), where each node v in V represents a schedulable
instruction. The resource allocation of each instruction is given by its reservation table r(v). An
edge e = (vi, vj) ∈ E ⊆ V × V indicates a dependency between vi and vj and it is weighted with
the minimum delay cycles given by delay(e) the instruction vj can be started after vi.
The dependencies between instruction vi and vj, i < j can be further categorized into the following
kinds [120]:
Data dependence: vi writes to a resource read by vj. Consequently, vi must be scheduled before
vj. This dependency is also referred to as Read After Write (RAW) dependency and is also
the most common type.
Anti-dependence: vj reads a storage location written by vk with k 6= i that is overwritten by
vi. Thus, in a correct schedule vj reads the value defined by vk before vi overwrites it. This
is also known as Write After Read (WAR) dependence. Since this is often the result of
instructions that write results late in the pipeline while others read the result early in the
pipeline the associated delay is usually negative.
Output dependence: vi and vj write to the same storage location. A valid schedule must
perform the writes in their original order, i.e. the storage location contains the result of vj
after executing both instructions. This dependency is also denoted as Write After Write
(WAW) dependency.
Control dependence: Determines the ordering of vj with respect to a branch instruction vi so
that vi is executed in correct program order and only if it should be. Thus vj is not executed
until the branch destination is known. Generally, this kind of dependency can also be seen
as a data dependency on the Program Counter (PC) resource.
Note that the Read After Read (RAR) dependency is not considered a data hazard.
Since an instruction vi may take several cycles until its result becomes available to vj it is the
scheduler’s task to fill these so called delay slots with useful instructions instead of No-Operations
(NOPs). Given a dependency graph, a valid schedule is obtained with a mapping function S that
assigns each node v ∈ V a start cycle number c, c ∈ N such that:
26 Chapter 3. A Short Introduction to Compilers
1. S(vi) + delay(vi) < S(vj) to guarantee that no dependencies are violated.
2. r(vi) ∩ r(vj) 6= ∅ to avoid structural hazards.
The goal is now to find a schedule Sopt that needs the fewest number of cycles to execute. Let
I denote the set of available machine instructions, then the length L(S) of a schedule S can be
described as follows:
L(S) = max(S(v) + max(delay(v, w))),∀v ∈ V,w ∈ I (3.2)
The worst-case delay makes sure that the results are definitely available before instructions of po-
tential successor basic blocks are executed. Unfortunately, computing the optimal schedule Sopt is
an NP-complete problem. Several heuristics are in use for scheduling whereas list scheduling [60] is
the most common approach. This algorithm for local scheduling keeps a ready set that contains all
instructions v which predecessors in the dependency graph have already been scheduled. The list
scheduler selects an instruction from the ready set and inserts it into the schedule S. Afterwards,
the ready set is updated accordingly and the scheduler proceeds with the next instruction from
the ready set. Different heuristics have been proposed to pick a node from the ready set since this
strongly influences the length of the schedule. For instance, one heuristic picks the instruction on
the current critical path. This path represents the theoretical optimal schedule length. Figure 3.6
shows an example using this heuristic.
1 2
5
3 4
6
7
1 2 1
1 2
2
5
3
6
7
1
1 2
NOP75
NOPNOP4
653
132
241
Slot2Slot1Cycle
5
4
3
2
241
Slot2Slot1Cycle
5
4
3
132
241
Slot2Slot1Cycle
5 6
7
1 2
5
4
653
132
241
Slot2Slot1Cycle
7
Ready Set1
1
Figure 3.6: List scheduling example; note that two instructions are scheduled in each step
List scheduling has a worst-case complexity that is quadratic in the number of instructions to
schedule. However, list scheduling is conceptually not effective in handling negative latencies (in
case of anti-dependencies) and filling delay slots. A solution to this problem are backtracking
schedulers [208]. Such schedulers can revert previous scheduling decisions to schedule the current
instruction earlier if this is likely to be more advantageous.
3.4. Retargetable Compilers 27
The amount of parallelism that can be exploited within a single basic block is quite limited since
it contains only a few instructions on average. This is especially a problem for loop bodies which
constitute typically the hot-spots of a program. One way to increase the number of instructions
in loop bodies is loop unrolling, i.e. duplicating the loop body while reducing the number of
required iterations. Another possibility is a scheduling technique especially for loops, calledmodulo
scheduling [39]. It is an algorithm for software pipelining loops [154], i.e. the overlapping execution
of several iterations.
An algorithm for global scheduling is trace scheduling [115]. The basic idea is to jointly schedule
instructions of frequently executed and consecutive basic blocks. Such a sequence of basic blocks
is called a trace. In this way the opportunities for ILP exploitation are increased. However, since
the basic block boundaries are neglected, undesired side effects may arise. In order to fix this,
compensation code has to be inserted. Of course, this results in a significant code size increase
which constitutes the major drawback of this approach.
3.3.5 Code Emitter
The code emitter is the final phase of the compiler backend. It is responsible to write the result
of the previous phases into a syntactically correct assembly program, typically in an output file.
The data structure of the emitter is an emission table. Each row, sorted in increasing order,
represents a clock cycle and each column an instruction. The code emitter first fills the emission
table using the clock cycle information determined by the scheduler. Thus, each row represents the
instructions that are executed together. Afterwards, the table is dumped row by row, where empty
cells are replaced by NOP instructions. While this is straight forward for single issue architectures,
i.e the table has only one column, constructing instructions for ILP architectures is sometimes
more difficult. Such architectures typically impose constraints on how the instructions can be
combined to build a valid instruction word. Therefore, a packer is incorporated in the emitter
that composes syntactically correct assembly instructions for a given row. The final executable
is then build from the assembly file using an assembler and linker. Both are separate tools which
run after the compiler.
3.4 Retargetable Compilers
Retargetable compilers are capable of generating code for different hardware architectures with
few modifications of its source code. Such compilers take a formal description, e.g. specified in
an ADL, of the target architecture as input and adapt themselves to generate code for the given
target. The retargetability support mostly needs to be provided for code selector, scheduler,
register allocator, i.e. the compiler backend (Figure 3.7).
Different degrees of retargetability exists to achieve this goal. According to the classification in
[195], compilers can be assigned to one of the following classes:
28 Chapter 3. A Short Introduction to Compilers
Target
Description
#N
Target
Description
# N
Frontend
Backend
# 1
Backend
# 2
Backend
# N
…
Assembly
Code # 1
Assembly
Code # 1
Assembly
Code # 2
Assembly
Code # 2
Assembly
Code # N
Assembly
Code # N
Frontend
Retargetable
Backend
Target
Description
# 2
Target
Description
# 2
…
Target
Description
# 1
Target
Description
# 1
Assembly
Code # 1
Assembly
Code # 1
Assembly
Code # 2
Assembly
Code # 2
Assembly
Code # N
Assembly
Code # N…
Figure 3.7: Non-retargetable vs. retargetable compiler flow
Parameterizable: Such compilers can only be retargeted to a specific class of processors sharing
the same basic structure. The compiler source code is largely fixed. The machine description
only consists of numerical parameters such as register file sizes, word lengths, the number
of functional units, or different instruction latencies.
User retargetable: An external machine description given in a dedicated language contains the
retargeting information. All information required for code generation is automatically de-
rived from this description. The specification does not require in-depth compiler knowledge
and hence, can be performed by an experienced user.
Developer retargetable: Retargeting is also based on an external target description. How-
ever, the specification requires extensive compiler expertise usually possessed only by very
experienced users or compiler designers.
A further distinction between retargetable compilers depends on the supported processor archi-
tectures. Several compiler environments are limited to a certain processors class, such as:
General Purpose Processors (GPPs): GPPs are characterized by an universal instruction-
set architecture which provides a high degree of flexibility. As a result, they achieve good
performance for a wide variety of applications. Unfortunately, this comes usually at the
expense of a higher power consumption which makes them pretty much unusable for the
embedded domain. Instead, such processors are widespread in desktop or portable PCs.
Prominent examples for this class are MIPS [160], ARM [33] and the well-known Intel x86
architectures [108].
Very Long Instruction Word Processors (VLIW): This architecture is designed to exploit
ILP which comes along with very high performance. Several functional units can be executed
in parallel, whereas each unit is related to a specific field in the instruction word. Since
such processors do not feature dedicated scheduling hardware like superscalar architectures,
the compiler is responsible for exploiting the ILP which might be present in the given
3.5. Synopsis 29
applications. Representative examples of this processor class include the TriMedia and
Nexperia architectures [169], the Embedded Vector Processor [134] and the ST200 [75].
Digital Signal Processors (DSPs): DSPs have been specifically designed for signal-processing
applications. Consequently, their instruction-set supports dedicated instructions for the ef-
ficient execution of common signal-processing computations, such as Fast Fourier Transfor-
mation (FFT) or digital filtering. Additionally, such processors usually feature hardware
multipliers, Address Generation Units (AGUs) and zero overhead loops. Typical DSP ex-
amples are the TI C5x and C6x [237], the ADSP 2101 [34] and the MagicDSP [62].
Micro-controllers: Micro-controllers operate at clock speeds of as low as a few MHz and are very
area efficient. The processor core implements a Complex Instruction-Set Computer (CISC)
architecture. The chip typically integrates additional elements such as Read-Only Memory
(ROM) and Random Access Memory (RAM), Erasable Programmable ROM (EPROM) for
permanent data storage, peripheral devices, and input/output (I/O) interfaces. They are
frequently used in automatically controlled products and devices, such as engine control sys-
tems, remote controls, office machines, appliances etc. Examples for this kind of architecture
are the Motorola 6502 [162] and the Intel 8052 [108].
Application Specific Instruction-set Processors (ASIPs): ASIPs show highly optimized
instruction-sets and architectures, tailored for dedicated application domains such as im-
age processing or network traffic management. In this way they achieve a good compromise
between flexibility and efficiency. Examples of this kind are ICORE [228], SODA [258], a
channel decoder architecture for third-generation mobile wireless terminals [69] and an ASIP
for Internet Protocol Security (IPSec) encryption [97].
Some prominent retargetable compiler primarily for GPPs are gcc [78] and lcc [42]. Trimaran
[240] and IMPACT [48] are examples for retargetable compilers for VLIW architectures. Other
examples include CoSy [30], LANCE [198], SPAM [224], and SUIF [226]. Some of them constitute
a key component of the ASIP design environments discussed in Chapter 4. A comprehensive
survey of retargetable compilers can be found in [200].
3.5 Synopsis
• Compilers can be coarsely separated into a frontend and a target specific backend (code
selector, scheduler, register allocator).
• Retargetable compilers can be quickly adapted to varying processor configurations.
• Such compilers are capable of generating the backend components from a formalized pro-
cessor description (e.g. an ADL model).
Chapter 4
Related Work
In general, ADL design must trade-off the level of abstraction vs. generality. ADLs must capture
a wide variety of embedded processors with ever changing irregularities. On the one hand a
lower level description captures structural information in more detail, but on the other hand the
detailed description makes it difficult to extract certain information like instruction semantics
for instance. Obviously, this is easier using higher-level descriptions, however, they make the
generation of e.g. cycle-accurate simulators inherently difficult. Over the past decade, several
ADLs have emerged, each with their own strengths and weaknesses.
In this chapter, the related work in the field of ADL based ASIP design is discussed.
4.1 Instruction-set centric ADLs
nML The nML language [142] was originally proposed by the Technical University of Berlin. It
is one of the first ADLs to introduce a hierarchical scheme to describe instruction-sets. The
topmost elements of the hierarchy represent instructions, and elements lower in the hierarchy
are partial instructions (PI). Two composition rules can be used to group the PIs in their
parents: the AND-rule groups several PIs into a larger PI and the OR-rule enumerates
alternative PIs corresponding to an instruction. For this purpose, the description utilizes an
attribute grammar [121].
Though classified as instruction-set centric language, nML is not completely free of structural
information. For instance, storage units like registers or memory must be explicitly declared.
Furthermore, it is assumed that each instruction is executed in one machine cycle; there is no
pipeline modeling. The language is used by the instruction-set simulator called SIGH/SIM
[6] and the retargetable code generator CBC [7, 143]. It is also used by the instruction-set
simulator CHECKERS [250] and the code generator CHESS [61] developed at the IMEC
31
32 Chapter 4. Related Work
institute [105]. These tools have later been commercialized and are now available from
Target Compiler Technologies [232]. Their tools include support for pipeline modeling and
feature a HDL generator. They have successfully been employed for several DSPs and ASIPs.
Recently, enhanced support for instruction predication has been added to the optimizing C
compiler component of the Chess/Checkers tool-suite.
Another development branch, called Sim-nML [245, 203], has been started by the Indian
Institute of Technology and Cadence Inc. The enhancements include support for pipeline
modeling, branch prediction and hierarchical memories. The generated software tools include
an instruction-set simulator supporting interpretative and compiled simulation, assembler
and a code generator [146]. Additionally, a tool called Sim-HS is available that implements
high level behavioral and structural synthesis of processors from their Sim-nML specifications
[212].
The nML based simulators are known to be rather slow. Target, however, claims to have
faster instruction-accurate simulation techniques which achieve a simulation speed that is
over 100 times faster than conventional cycle-accurate simulators. However, no results have
been published yet.
Since nML models constraints between operation by enumerating all valid combinations,
the resulting description can be quite lengthy. Furthermore, VLIW processors or DSPs with
irregular ILP constraints are – if at all – hard to model with nML.
ISDL The acronym stands for Instruction Set Description Language [87]. It was developed at
the Massachusetts Institute of Technology (MIT) to assist hardware-software co-design of
VLIW architectures. Similar to nML, ISDL uses an attribute grammar for the instruction-set
description and storage elements like registers are the only structural information defined
for each architecture. However, in contrast to nML, which captures all valid instruction
compositions, ISDL employs boolean expressions to define invalid combinations. This often
results in a simpler constraint specification and allows to model much more irregular ILP
constraints.
ISDL is used by the Aviv compiler [215] as well as the related assembler and linker [86].
The Aviv compiler, which is based on the SUIF [226] and SPAM [224] compiler infrastruc-
ture, supports phase-coupled code generation which offers certain advantages over strictly
separated code generation phases. However, since a large number of heuristics need to be
employed to cope with the overall complexity, the optimality is at least questionable. So
far, only results for artificial VLIW processors have been reported. Hence, it is not entirely
clear how Aviv performs for more irregular real-life embedded processors.
Moreover, ISDL is used by the retargetable simulator generation system GENSIM and a
synthesizable HDL code generator [85].
4.1. Instruction-set centric ADLs 33
CSDL The Computer System Description Language (CSDL) is actually a family of machine
description languages for the Zephyr compiler environment [4]. It has mainly been developed
at the University of Virginia and consists of the following languages:
• The Specification Language for Encoding and Decoding (SLED) [166] describes in-
struction syntax and binary encoding and is used to retarget assembler, disassembler
and linker. SLED is flexible enough to describe RISC and CISC computers. How-
ever, there is no notation of hardware resources nor explicit constraints for instruction
compositions. As a result, SLED is not suitable for VLIW description.
• For the description of instruction semantics, the Register Transfer list (RT-list) lan-
guage λ-RTL [165] is used. It is based on Standard-ML [202] and was mainly de-
veloped to reduce the description effort for Zephyr’s Very Portable Optimizer (VPO)
[140]. VPO provides instruction selection, instruction scheduling, and classical global
optimization. Unfortunately, VPO needs quite verbose RT-lists for the instruction-set
description as input. Therefore, λ-RTL is translated in RT-lists instead of retarget-
ing VPO. However, irregular architecture features such as special-purpose registers,
complex custom instructions and ILP constraints are hard to model.
• The Calling Convention Specification Language (CCL) [138] is used to define procedure
calling convention for uniform procedure call interfaces, i.e. how parameters and return
values are passed between function calls. Both is required by the compiler as well as
the debugger.
A drawback, though, is that all these descriptions must be kept consistent to ensure cor-
rectness. Furthermore, due to the limitation mentioned above, CSDL is more suited for
conventional general-purpose or regular RISC/CISC processors. Embedded processors with
architectural irregularities or VLIW architectures usually cannot be modeled at all. So far,
results for HDL generation have not been reported yet.
Valen-C [17, 18] is a C language extension to support explicit and exact bit-width specification
for integer data types. The retargetable compiler Valen-CC takes an application written in
Valen-C and a description of the instruction-set as input. It produces code only for RISC ar-
chitectures. The instruction-set description represents only the instruction-set, i.e. pipelines
or resource conflicts are not modeled. A separate description is used for simulator retarget-
ing.
One commonality of all these languages is the hierarchical instruction-set specification using at-
tribute grammars. In this way common properties of instructions can be easily factored out which
simplifies the instruction-set description to a large extend. Instruction semantics for compiler
generation can be easily extracted due to the explicit specification in the form of RT-lists. On the
other hand, such languages do not contain detailed pipeline and timing information. This makes
34 Chapter 4. Related Work
it inherently difficult to generate cycle accurate simulators and, to a certain extend, instruction
schedulers. This can only be circumvented by limiting the architectural scope of the language so
that certain assumptions about the target architecture can be made. Moreover, since this kind of
ADL does not contain any, or only limited structural information the generation of synthesizable
HDL code is either not supported or the quality of the generated HDL code is not satisfactory.
4.2 Architecture centric ADLs
MIMOLA The Machine Independent Microprogramming Language (MIMOLA) [211] is an ex-
ample for a Register Transfer level (RT-level) based ADL, developed at the University of
Dortmund. It was originally intended for micro-architecture design. A MIMOLA descrip-
tion mainly consists of two parts, the hardware part with a netlist of component modules,
and the software part describing the applications in a PASCAL-like syntax.
Several tools based on the MIMOLA language have been developed [180], including the
MSST self-test program compiler, the MSSH hardware synthesizer, the MSSB functional
simulator, the MSSU RT-level simulator and the MSSQ code generator. A single MIMOLA
model serves as input for all these tools.
Since pipelined targets cannot be modeled with MIMOLA, the architectural scope is mostly
limited to architectures with single-cycle instructions. Furthermore, the MSSQ compiler pro-
duces sometimes poor code quality and suffers from high compilation times. The RECORD
compiler [199] constitutes the successor of MSSQ and eliminates some of these limitations.
It generates better code quality, however, it is restricted to the class of DSP architectures.
Another limitation is the missing C frontend, only the data flow language SILAGE [57] is
supported.
AIDL The AIDL language [230] introduces several levels of abstraction to model a processor. It
has been designed to describe time relations such as concurrency and cause/effect relations
between pipeline stages in a simple and accurate way. The concept of timing relations is
based on interval temporal logic [38]. Each behavior is described using a so called stage
which corresponds usually to a pipeline stage. Sequentiality and concurrency is specified
within or between stages. So far, AIDL was only employed to model three processors which
are all based on the PA-RISC instruction-set architecture [101].
As described in [229], it is possible to generate synthesizable HDL code and a simulator from
an AIDL specification. So far, support for compiler, assembler and linker generation is not
available.
UDL/I The UDL/I language [241] is also a RT-level hardware description languages, but in
contrast to MIMOLA mainly intended for compiler generation. It is used as input for the
4.3. Mixed-level ADLs 35
COACH ASIP design environment [95] which extracts the instruction-set from the UDL/I
description. However, this process imposes some restrictions on the class of supported
architectures. In particular VLIW architectures are not supported. The generated software
tools include an instruction-set and cycle accurate simulator.
In general, RT-level ADLs are more intended for hardware-designers. They provide concepts for
a detailed specification of the micro-architectures in a flexible manner. Several approaches have
proven that based on a single ADL model design automation tools for logic synthesis, test genera-
tion as well as retargetable compilers and simulators can be generated. However, from a compiler
designers perspective, all information regarding the instruction-set is buried under an enormous
amount of micro-architectural details. Thus, extracting the semantics of instructions automat-
ically is quite hard, if not impossible, without restrictions on description style and supported
target architectures. Furthermore, considering that merely describing a processor at the RT-level
alone is a tedious task, quick modifications as required for efficient architecture exploration are
self-prohibitive. Moreover, the simulators generated from such ADLs are known to be rather slow
[132].
4.3 Mixed-level ADLs
Maril The Maril language is the description format for the retargetable compiler Marion [53]. A
Maril description contains both instruction-set description as well as coarse grained struc-
tural information. However, it does not employ a hierarchical scheme for instruction-set
specification like instruction-set centric languages. On the other hand, it contains more
structural information than those languages. This enables the generation of resource-based
schedulers which can yield significant performance improvements for deeply pipelined pro-
cessors. Unfortunately, the instruction behavior must be described with a single expression
that can only contain a single assignment. While this is sufficient for compiler generation,
it generally provides not enough information for accurate simulation. For instance, addi-
tional side effects of instructions (e.g. affecting condition code registers or flags) cannot be
described.
Maril is mainly intended for RISC processors, describing VLIW processors is not possi-
ble. Moreover, it does not contain any information about the instruction encoding. Thus,
retargeting an assembler or disassembler is not possible.
MESCAL/MADL The Mescal Architecture Description Language (MADL) employs an Op-
eration State Machine (OSM) [252] computational model to describe the operations. As
the name implies, it was developed within the Mescal [131] group of the Gigascale Silicon
Research Center (GSRC) [92]. An OSM specification basically separates the processor into
two interacting layers. The operation layer models operation semantics and timing, whereas
36 Chapter 4. Related Work
the hardware layer describes the micro-architecture. The target scope includes scalar, su-
perscalar, VLIW and multi-threaded architectures. The approach emphasizes on simulator
generation, other software development tools are not generated. Successful case studies are
reported for the StrongARM [64] and the PowerPC-750 [164].
The study claims that instruction schedulers can be retargeted as well but no results in this
regard have been published yet. Meanwhile, OSM has been successfully employed to model
on-chip communication architectures which allows to generate cycle-accurate simulators for
multi-processor SoCs [254].
HMDES/MDES The HMDES language [118] constitutes the input for the IMPACT research
compiler [177, 48] developed at the University of Illinois. IMPACT has been designed to
efficiently explore wide issue architectures which offer lots of scheduling alternatives for
instructions. Consequently, the definition of instruction’s reservation tables is a central
notion in HMDES. However, information about instruction semantics, assembly syntax or
encoding information are missing in HMDES. This is a result of IMPACT being not designed
as a fully retargetable software development tool chain. Basically, IMPACT is an EDG [67]
based optimizing C frontend. Apart from standard optimizations [1], IMPACT supports
some new concepts for ILP exploitation based on extended basic blocks notations [51, 192]
and predicated execution [220].
The MDES machine description format of the Trimaran compiler infrastructure [240] also
uses a HMDES description as input. Trimaran incorporates IMPACT as well as the Elcor
research compiler [209] from HP Labs. Initially, the compiler could only be retargeted to a
single class of processors, called HPL-PD [244]. Architectural parameters include mainly ILP
related options such as the number of registers, instruction latencies, instruction word length
and the number of available functional units and their scheduling constraints. Meanwhile,
it has also successfully been retargeted to the ARM [32] and the WIMS processor [201].
Trimaran is also employed in the Program In Chip Out (PICO) [243] system for the auto-
matic design of custom processors. Such processors consists of a configurable VLIW template
(i.e. HPL-PD based) [40] and a non-programmable processor (a one or two dimensional array
of processing elements) [204].
EXPRESSION The EXPRESSION language [178, 12, 181] was developed at University of Cal-
ifornia at Irvine. An EXPRESSION description consists of a distinct behavioral and struc-
tural section. The behavioral section is similar to ISDL, but it is missing assembly syntax
and binary encoding. The specified operations can be bundled to instructions in order to
model VLIW architectures. Additionally, all operations must be manually mapped to generic
compiler operations in order to enable compiler generation. The structural section directly
describes a netlist of pipeline stages and storage units to automatically generate reservation
4.4. Other related approaches 37
tables required by the scheduler based on the netlist [179]. However, HDL models cannot
be generated yet.
An EXPRESSION specification is used by the simulator SIMPRESS [20] and the retar-
getable compiler EXPRESS [10]. All tools are integrated into a visual environment, called
V-SAT. So far, the modeled architectures include: ARM7 [32], SPARC [225], TI C6x [237],
DLX [120], Renesas SuperH SH3 [207], and Motorola 56k DSP [163].
IDL The Instruction Description Language (IDL) [187] is used by the FlexWare2 system
[183, 184, 185]. The environment is the successor of FlexWare [186] developed at STMi-
croelectronics. IDL is used in conjunction with the ISA database called Flair which drives
the entire Flexware system. It consists of the CoSy [30] based FlexCC compiler, assembler,
linker, the simulator FlexSim, the debugger FlexGdb, and the FlexPerf profiler.
The generated code quality is reported to be close to hand-crafted assembly code. How-
ever, the target description contains a large amount of redundancies and hence, requires a
significant verification effort to be kept consistent. Furthermore, FlexWare is intended for
in-house use only.
RADL The Rockwell Architecture Description Language (RADL) [47] is a follow-up of the first
version of the Language for Instruction-set Architecture (LISA) [13]. It focuses on explicit
support of detailed pipeline behavior to enable the generation of cycle- and phase-accurate
simulators [136], other software tools are not generated. However, so far nothing has been
published about the simulators using RADL.
Mixed-level ADLs basically extend instruction-set centric languages by including structural in-
formation. So far, this is mainly used to enable the generation of fast cycle-accurate simulators
and instruction schedulers. The retargeting of the compiler’s code selector has mostly either to be
performed manually or is more or less fixed due to a pre-defined processor template. Furthermore,
support for HDL generation is usually not implemented.
4.4 Other related approaches
ASIP Meister The ASIP Meister environment, formerly known as PEAS-III [150, 21], was
jointly developed by the Semiconductor Technology Academic Research Center and the
Osaka University. It is an enhanced version of the PEAS system [123, 122], capable of gen-
erating a synthesizable hardware description and the complete software development tool
chain, i.e. a CoSy based C compiler [216], assembler, linker, and simulator. Additionally, it
provides estimates for power consumption, maximum clock frequency and silicon area.
ASIP Meister has no uniform ADL. It is basically a Graphical User Interface (GUI) used
to model the architectures using functional blocks defined in a so called Flexible Hardware
38 Chapter 4. Related Work
Model (FHM) library [151]. Each block is associated with behavior, RT-level, and gate-level
information. Unfortunately, the library is not user extensible which limits the architectural
scope. Furthermore, for compiler generation the semantic for each block must be manually
specified. So far, successful designs have been reported for the DLX [120] and the MIPS-
R3000 [160], even though the complete instruction-set architecture could not be implemented
in both cases.
Based on the ASIP Meister environment, a platform for synthesizable HDL generation of
configurable VLIW processor was developed [257]. However, no information is available
regarding the software tool generation for this architecture class.
UPFAST/ADL The UPFAST [222] system automatically generates a cycle-accurate simulator,
an assembler and a disassembler from a micro-architecture specification written in the Ar-
chitecture Description Language (ADL). So far, it has only been successfully deployed for
several artificial targets based on the MIPS ISA. The speed of the generated simulator is
reported to be two times slower than a hand-crafted version.
PROPAN/TDL The target description language (TDL) is used in the retargetable post-pass
assembly optimization system PROPAN [58], developed at Saarland University. Basically,
an assembler parser as well as a set of C files are generated from a TDL description. The C
files can be included in applications to provide a generic access to architectural information.
However, the architectural scope is mostly limited to VLIW DSPs.
BUILDABONG The BUILDABONG [234, 233] is intended to aid the design of special computer
architectures based on architecture and compiler co-generation. The input of this tool is
an Abstract State Machine (ASM) model of the target architecture. It is either derived
from a XASM description or given by a schematic tool entry. BUILDABONG supports the
generation of HDL models, simulator, and compiler. The user must specify the instruction-
set and the code generator generator’s grammar in a GUI called Compiler Composer which
finally generates the compiler executable [55]. The machine model is automatically extracted
from the graphical architecture description and converted to an Extensible Markup Language
(XML) based description, called Machine Markup Language (MAML) [56]. This description
is used by the MAML compiler and constitutes the input for the Compiler Composer. So far,
only artificial architectures have been used as case studies. Future developments will focus
on complex architectures such as the TI C6x family and reconfigurable ASIPs. However,
results regarding the simulation speed, code quality and the exact architectural scope have
not been reported yet.
Liberty The Liberty Simulation Environment (LSE) [158] models processors by connecting hard-
ware modules through their interfaces. These modules are either predefined or parameteri-
zable. From this specification, given in the Liberty Structural Specification (LSS) language
4.4. Other related approaches 39
[157], a cycle-accurate simulator is generated. Since Liberty does not provide the facility for
capturing the instruction behavior and binary encoding, it is not suited to create software
development tools.
Babel The Babel [251] language was originally intended for the specification of non-functional
IP blocks. However, the corresponding integration framework retargets a set of GNU tools
[78] (more specifically, the binary utilities) to integrate different IP cores. Obviously, this is
limited to the architectures supported by the GNU tool chain. The employed architectures
include SPARC [225], SimpleScalar [223] and Alpha [194]. Babel is also utilized to retarget
the SimpleScalar simulator [54].
MADE The Modular VLIW processor Architecture and Assembler Description Environment
(MADE) [182] generates a library of behavioral functions and the instruction-set of the
machine from the related architecture description. The library is then linked to a reconfig-
urable scheduling engine which results in a configured optimizer-scheduler. The automatic
configuration of a cycle-accurate simulator is under development. So far, this environment
is only used for the MagicDSP [62].
ARC The ARCtangent processor family [35] from ARC Inc. is a RISC/DSP architecture with a
32-bit four stage pipeline. Each core can be extended by pre-defined modules like floating
point support, advanced memory subsystem with address generation, and special instruc-
tion extensions for common DSP algorithms. The basic ISA implements 86 mixed 16/32 bit
instructions which can be extended to a certain extend by custom instructions. A graphi-
cal user interface (GUI), called ARChitect, allows the designer to select between the given
configuration options and to specify the custom instructions. Additionally, the environment
provides a simulator, a Real Time Operating System (RTOS), and a C/C++ compiler. How-
ever, the instruction-set extensions cannot be directly exploited by the compiler. Instead,
the programmer is forced to use assembly like function calls (compiler intrinsics) or inline
assembly which reduces the re-usability to a great extend.
Tensilica The Xtensa architecture [191] from Tensilica Inc. [235] offers a large number of con-
figurable or user-defined extensions that can be plugged in to the processor core. The base
architecture has 80 RISC instructions and includes a 32-bit ALU and 32 or 64 general-
purpose 32-bit registers. Among the configurable options are DSP engines, floating point
support, the memory interface, and caches. Custom instructions for application-specific
performance improvements can be specified using the Tensilica Instruction Extension (TIE)
language. The software tools consist of a (GNU based) C-compiler, assembler, linker, simu-
lator, and a synthesizable HDL model. Tensilica reports a 20% performance improvement of
the Xtensa C/C++ Compiler (XCC) as compared to a regular gcc compiler. The compiler
also supports custom instructions and vectorization to a certain extend.
40 Chapter 4. Related Work
Others A quite recent ADL mainly designed for compiler generation is presented in [213, 70]. The
syntax is based on XML. On the compiler side, an earlier version relies on the Open Compiler
Environment (OCE) from Atair. The current version uses an extended gcc-frontened (for
Embedded-C [68]) and a custom backend. Up to now, the language has been used to model
the VLIW DSPs xDSPcore [23] and CHILI [174] as well as the MIPS-R2000 [160] processor.
The description contains enough information to enable the generation of other tools such as
simulator, assembler, linker etc. However, nothing in this regard has been published yet.
Other existing ADLs include ISPS[139], ASIA/ASIA-II [103, 102], ASPD [255], EPICS [206],
READ [256], and PRDML [27].
Further approaches employing parameterizable generic processor cores include JazzDSP [45]
and DSP-core [153].
Several tools choose a different route for implementation. They directly generate a synthe-
sizable HDL model or hardware implementations from the given application. Examples for
this approach are ARTBuilder [31] or the PACT HDL compiler [19]. The major drawback,
of course, is the limited flexibility of the generated hardware.
Quite a large amount of ADLs are already available and it is reasonable to expect more
new ADLs, or at least ADL extensions. The effort to develop a new ADL from scratch or
to undertake the tedious task of modifying an existing language has lead to a new kind of
ADL. Such ADLs are based on XML. In this way, a standard to encode the elements of
an architecture is provided. This saves development time and makes the model reusable
and interchangeable between tools. Examples for this kind are ADML [238] and xADL [65].
However, ASIP design environments using these languages are not yet known.
Configurable processor cores, ADLs with a limited architectural scope (such as DSP or VLIW) or
ADLs designed for a specific purpose (e.g. simulator or compiler generation) are mostly capable of
generating an efficient set of tools and hardware. The advantage of a limited architectural scope,
in particular in case of configurable cores, is the reduced verification effort, though, at the expense
of a limited design space. In contrast, a broader architectural scope results in increased verification
effort but allows a larger design space. Corresponding ADLs must be suitable for a wide variety
of architectures while at the same time providing design automation for all ASIP design phases.
Such ADLs usually require sophisticated algorithms to generate high quality software tools and
hardware as compared to domain-specific or tool-specific ADLs.
All recent ADLs belong to the mixed-level class. They are well-suited to meet these demands and
they have been successfully employed in academic research as well as in industry. Unfortunately,
these ADLs are either bound to a pre-defined processor template and hence, suffer from limited
flexibility, or do not support the generation of all software development tools and corresponding
HDL model. While the generation of simulators is mostly supported, compilers, in particular
the code selector description, must still be retargeted manually. This process requires significant
4.5. Synopsis 41
compiler knowledge and delays the availability of a C compiler for early architecture exploration.
Thus, to further lower the entry barrier to compiler generation and to reduce the time consuming
and tedious manual effort, the automatic generation of code selector descriptions is of strong inter-
est. This has been the main motivation to implement a methodology for code selector generation
from ADL processor models without sacrificing their flexibility. The presented approach in this
thesis is based on the LISA ADL. The next chapter briefly introduces the corresponding design
environment.
4.5 Synopsis
• Abstract processor modeling is established as an efficient solution for ASIP design.
• Regardless of ADL implementations, a significant gain for ASIP design in terms of develop-
ment time over the classical ASIP design approach is achieved.
• Due to the difficulty designing an ADL that in particular supports the generation of the
complete software tool chain (in particular compiler and simulator) current ADLs sacrifice
flexibility, introduce redundancies or support only the generation of particular software tools.
• Certain compiler relevant information (e.g. scheduler tables) can already be extracted from
ADL descriptions, while others (code selector description) must still be provided manually.
• An ADL based design environment that supports the automatic generation of all software
development tools while keeping its flexibility is proposed in this thesis.
Chapter 5
Processor Designer
In this thesis, the Language for Instruction Set Architectures (LISA) ADL is used and extended for
automatic generation of C compilers. LISA is the key component of the Processor Designer ASIP
design environment, formerly known as the LISA Processor Design Platform (LPDP) [14, 13]. It
was initially developed at the Institute for Integrated Signal Processing Systems at the RWTH
Aachen University and is now commercialized by CoWare Inc. [50].
C-Compiler
Software ApplicationDesign Integration and Verification
Simulator / Debug.
Assembler/
Linker
Architecture Exploration Architecture Implementation
LISA 2.0 Description
regs
data
mem
prog
mem
pipeline control
prog
seq
IF/ID ID/EX/WB
LISA
Assembler
C
compiler
Linker
Simulator
Profiler
Application
System on Chip
Processor Designer
System IntegratorSoftware Designer
Figure 5.1: LISA Processor Designer
As illustrated in Figure 5.1, based on a single LISA model this platform targets architecture
exploration, architecture implementation, software tools generation and system integration (cf.
Section 2.1). LISA belongs to the group of mixed-level ADLs. Hence, a LISA model captures the
behavior, the structure, and the I/O interfaces of the processor architecture. It has been used to
43
44 Chapter 5. Processor Designer
describe a broad range of architectures including ARM9 [33], TriMedia [169], C54x [236], MIPS32
4K [161], and to develop ASIPs for different application domains [228, 97].
The Processor Designer provides an Integrated Design Environment (IDE) to support the manual
creation and configuration of the LISA model. From the IDE the so called LISA Processor
Compiler is invoked. It parses the description and generates software development tools like
Instruction-Set Simulator (ISS) [25], debugger, profiler, assembler and linker [16]. Synthesizable
HDL code (VHDL and Verilog) [170] can be generated automatically as well. Moreover, the
LISA ISS can be easily integrated into a co-simulation environment using a set of well-defined
interfaces. In [15] the integration of several LISA models into the SystemC [239] environment
is described. SystemC was used to model the processor’s interconnection, external peripherals,
memories, and buses on a cycle-accurate level. A CoSy based C compiler is manually retargeted
via a GUI [149]. Instruction schedulers, though, can already be automatically generated [172].
The rest of this chapter is arranged as follows. The LISA language as far as relevant to understand
the compiler generation techniques presented in this thesis is introduced in the next section. A
detailed overview about LISA and the generated software development tools is given in [13].
Afterwards, Section 5.2 describes the current tool flow for C compiler generation.
5.1 The LISA Language
A single LISA model captures all architectural information. It basically consists of two parts:
resource declarations and operations. Resource declarations specify a subset of the processor
resources, namely registers, buses, memories, external pins and internal signals. The resources
can be parameterized w.r.t. signedness, bit-width and dimension.
RESOURCE {
MEMORY MAP {
RANGE(0x00100000, 0x002fffff) −> example mem[(31..0)];
}
RAM unsigned char example mem {
SIZE(0x00250000);
BLOCKSIZE(8,8);
FLAGS(R |W |X);
};
REGISTER unsigned int GPR[0..127];
PIPELINE pipe={FE ; DE ; EX ; WB };
PIPELINE REGISTER IN pipe {
unsigned int src1,src2,dst;
}
} ...
Listing 5.1: Resource declaration
5.1. The LISA Language 45
Configuration items for the memories include size, accessible block size, endian-ness etc. All
resources are global to the LISA model, i.e. they can be accessed within any LISA operation.
Listing 5.1 shows a typical LISA resource declaration. In the example, a 2 MB memory area named
example_mem is specified which is mapped into address space starting at 0x100000. Furthermore,
the general purpose register file named GPR with 128 32-bit wide registers and a pipeline named
pipe is declared. The pipeline stages are defined from left to right corresponding to the actual
execution order. PIPELINE_REGISTERS define the storage elements between pipeline stages, here
src1, src2, and dst.
main
fetch
decode
control arithmetic
ADDCALL JUMP
write-
back
FE
DE
WB
EX
Root operation
Pipeline stages
GROUP
SYNTAX
CODING
BEHAVIOR
ACTIVATION
Activations
…
SUB
Figure 5.2: LISA operation DAG
The major part of a model consists of operations. An OPERATION is the basic element of the ISA
description. Each instruction is usually distributed over several operations whereas each operation
in turn consists of several so called sections. The CODING section describes the binary coding,
the SYNTAX section the assembly syntax, and the BEHAVIOR section the operation’s behavior.
Operations are organized hierarchically in order to factor out commonalities of instructions which
reduces the description effort to a large extend. A modeled pipeline implies a cycle-accurate
LISA model and hence, each operation has to be assigned to one of the defined pipeline stages.
Moreover, operations can trigger the execution of a child operation by so called activations (via a
dedicated ACTIVATION section) or behavioral calls. Additionally, an operations can be activated
or called from several different operations.
The resulting structure is a so called LISA operation DAG D = (V,E). V denotes the set of all
LISA operations and E the edges due to activations or behavioral calls. The root operation is
the special main operation which is executed if the simulator advances one control step. Among
46 Chapter 5. Processor Designer
others, this operation activates the operation fetching the next instruction from memory and
advances the pipeline. Hence, a complete branch of the LISA DAG and the related operations
represent an instruction in the modeled target machine. Figure 5.2 gives an example.
OPERATION arithmetic IN pipe.DE{
DECLARE{
GROUP opcode = { ADD | | SUB | | ... };
INSTANCE rs1, rs2, rd = { reg };
INSTANCE writeback;
}
CODING { opcode rd rs1 rs2 0b00}
SYNTAX { opcode " " rd " " rs1 " " rs2 }
BEHAVIOR{
PIPELINE REGISTER(pipe, DE/EX).src1 = GPR[rs1];
PIPELINE REGISTER(pipe, DE/EX).src2 = GPR[rs2];
}
ACTIVATION { opcode, writeback;}
}
OPERATION ADD IN pipe.EX{
CODING { 0b00 }
SYNTAX { "ADD"}
BEHAVIOR{
int op1 = PIPELINE REGISTER(pipe, DE/EX).src1;
int op2 = PIPELINE REGISTER(pipe, DE/EX).src2;
PIPELINE REGISTER(pipe, EX/WB).dst = op1+op2;
} ...
}
OPERATION SUB IN pipe.EX{
CODING { 0b01 }
SYNTAX { "SUB" }
BEHAVIOR{
int op1 = PIPELINE REGISTER(pipe, DE/EX).src1;
int op2 = PIPELINE REGISTER(pipe, DE/EX).src2;
PIPELINE REGISTER(pipe, EX/WB).dst = op1−op2;
} ...
}
OPERATION writeback IN pipe.WB{
DECLARE{ REFERENCE dst; }
BEHAVIOR{
GPR[dst] = PIPELINE REGISTER(pipe, EX/WB).dst;
}
}
Listing 5.2: LISA operation hierarchy example
The delay (in cycles) between two connected operations depends on the abstraction level. In case
of instruction-accurate models, operations are simply activated along the increasing depth of the
5.2. Compiler Designer 47
LISA operation DAG, whereas in case of cycle-accurate models it is delayed until the activation
advances to the stage related to the activated operation.
List 5.2 provides the specification for three of the operations in the example LISA operation DAG.
More specifically, arithmetic, ADD, SUB, and writeback. They are assigned to the pipeline stages
DE, EX, and WB respectively. Because ADD and SUB use the same type of operands (i.e. reg), the ini-
tialization of the operands can be factored out and thus, is modeled in the operation arithmetic.
This relationship is given through the definition of GROUPs, whose members correspond to a list
of alternative, mutual exclusive operations. The group name can then be referenced within the
LISA sections, e.g. in the ACTIVATION section as depicted in the example. Here, all operations
potentially referenced by opcode are located in pipeline stage EX, i.e. the execution is delayed
until the subsequent cycle. The writeback operation is located in stage WB and consequently, is
two cycles delayed.
The SYNTAX describes the assembly syntax of the instruction. The syntax elements can be either
terminal character sequences like “ADD” or a nonterminal. The later can correspond to a single
INSTANCE of a LISA operation or a GROUP. The CODING section specifies the binary coding
in a similar way using “0” and “1” as terminal elements. The behavior of a LISA operation is
executed only if all terminal sequences and nonterminals (more specifically, single instances and
at least one group member) match the actual decoded instruction.
The BEHAVIOR section implements the combinatorial logic of the processor. The LISA language
allows arbitrary C/C++ descriptions of instruction behaviors, which achieves highest modeling
flexibility. As mentioned above, if a pipeline is modeled, the C/C++ instruction behavior descrip-
tion is typically distributed over different pipeline stages. In the example, arithmetic reads the
operands from the register file, stores them in the corresponding pipeline registers and activates
the operation currently referenced by opcode, so either ADD or SUB. These operations are executed
in the following cycle. Accordingly, they combine the operand pipeline registers and store the
result back into a pipeline register. Another cycle later the operation writeback writes the result
back to the register file. For that purpose, the dst instance declared in arithmetic has to be
referenced.
Apart from operation names, local variables can be declared and used in the BEHAVIOR section.
Global processor resources and pipeline registers can be accessed as well. It is even possible
to call external C/C++ functions or an internal LISA operation within the BEHAVIOR section
(behavioral call).
5.2 Compiler Designer
The Processor Designer employs the CoSy system from ACE [30] for compiler generation. CoSy
is a modular compiler generation system that offers numerous configuration possibilities both
48 Chapter 5. Processor Designer
at the level of the Intermediate Representation (IR) and the backend for machine code genera-
tion. As illustrated in Figure 5.3, CoSy is built around the CoSy Common Medium Intermediate
Representation (CCMIR) of the source program.
CC-
MIR
Front-
end
Sche-
duler
Code
Select.
Opti-
mizer
Reg
Alloc
Supervisor
Source
code
Target
code
Emit
Figure 5.3: CoSy compiler development platform
In general, a compiler is built by specifying a set of analyses and transformations, called engines,
that annotate and modify the CCMIR. Cosy comes already with a broad range of standard opti-
mizations [1], but can also be easily extended with user-defined engines due to its modular concept.
Each engine must exactly specify which elements of the IR it accesses using the Full Structured
Definition Language (fSDL) [100]. The engine’s execution order is provided in a dedicated spec-
ification using the Engine Description Language (EDL). From these pieces of information, a so
called supervisor is generated which schedules the engines and grants access to the CCMIR.
The Backend Generator (BEG) is the most important component of the CoSy system. It takes so
called Code Generator Description (CGD) files as input and generates most of the backend source
code automatically. A CGD model consists mainly of three components:
• a specification of available target processor resources like registers or functional units
• a description ofmapping rules (cf. Section 3.3.2), specifying how C/C++ language constructs
map to (potentially blocks of) assembly instructions
• a scheduler table that captures instruction latencies as well as instruction resource occupation
on a cycle-by-cycle basis
Apart from that, CoSy requires some more information like function calling conventions or the C
data type sizes and memory alignment. A more detailed description of CoSy can be found in [28].
As depicted in Figure 5.4, the Compiler Designer [149] basically extracts compiler-relevant infor-
mation from a given LISA processor model and translates it to a corresponding CGD description.
5.2. Compiler Designer 49
.c.c
Processor
Designer
AssemblerAssemblerISSISS
CompilerDesigner
.CGD
C compilerC compiler
CoSy
LISA 2.0 Description
regs
data
mem
prog
mem
pipeline control
prog
seq
IF/ID ID/EX/WB
.c
......
Design
goals
met ?
Design
goals
met ?
LISA
LinkerLinker
Code selectorCode selector
SchedulerScheduler
Code emitterCode emitter
Figure 5.4: Tool flow for retargetable compilation
Afterwards, CoSy can be invoked as a “backend” to generate the compiler executable. However,
this translation is quite challenging due to a number of reasons: While some information is explicit
in the LISA model (e.g. via resource declarations), other relevant information (e.g. concerning in-
struction scheduling) is only implicit and needs to be extracted by dedicated algorithms. Some
further, heavily compiler-specific, information is not at all present in the LISA model, e.g. C type
bit widths. Additionally, compiler retargeting is further complicated by the semantic gap (cf.
Section 6.1) between the compiler’s high-level model of the target machine and the detailed ADL
model that in particular must capture cycle and bit-true behavior of machine operations.
The Compiler Designer employs a semi-automatic approach for compiler generation. Compiler
information is automatically extracted from LISA whenever possible, while GUI-based user in-
teraction is employed for other compiler components. The Compiler Designer is organized in
different configuration dialogs and the user is guided step by step through the specification of
the missing items which could not be configured automatically or for further refinement of the
generated items.
Data Layout, Register Allocator and Calling Conventions: Purely numerical parameters
not present in the LISA model can be directly entered via GUI tables. This concerns mainly
compiler-dependent data like C type bit widths, type alignments, minimum addressable
memory unit size etc.
Configuration options for the register allocator include the selection of allocatable registers
out of the set of all available registers in the LISA model. For instance, registers selected
as frame or stack pointer need to be excluded from allocation. Another option regards
those registers which cannot be temporarily saved in memory. Finally, some processor
architectures allow the combination of several regular data registers to “long” registers of
larger bit width. The composition of long registers is also performed via the GUI.
The calling conventions basically describe the preferred passing of function parameters and
50 Chapter 5. Processor Designer
return values. The GUI provides a convenient dialog to specify for each C data type the
preferred passing method which can be either registers or stack.
Instruction Scheduler: Instruction schedulers determine the sequence in which instructions are
issued on the target processor. Besides structural hazards, data dependencies between in-
structions need to be taken into account (cf. Section 3.3.4). These constraints are captured
by scheduler tables containing latency information for the different kinds of dependencies
and the resource usage of instructions. These tables are generated fully automatically from
the given LISA model [172]. Since the generator guarantees a correct (yet sometimes too
conservative) scheduler, it is possible to manually override the extracted scheduler charac-
teristics in the GUI. From this information an improved backtracking scheduler is finally
generated.
Code Selector: In order to get an operational compiler, a minimum set of code selector rules or
mapping rules is needed. These mapping rules are the basis for the tree pattern matching
based code selector (cf. Section 3.3.2) in CoSy. The Compiler Designer comprises a so-called
mapping dialog (Figure 5.5). This dialog provides the set of available IR operations (top left
in Figure 5.5), defined nonterminals (bottom left) as well as the hierarchically organized set
of machine operations in the given LISA model (right). By means of a convenient drag-
and-drop mechanism, the user can manually compose mapping rules (top center) from the
given IR operations (1) and nonterminals (2). Likewise, the link between mapping rules and
their arguments on the one hand and machine operations and their operands on the other
hand is made via drag-and-drop in the mapping dialog (3). In this way, multi-instruction
rules which can even contain control flow as well as complex instructions like MAC can be
composed. The example from Figure 5.5 shows the mapping defined for a 32-bit multiply
operation, which is implemented by a sequence of two 16-bit multiply instructions and an
add instruction. Based on this manually established mapping, the Compiler Designer looks
up the required assembly syntax of involved instructions (4) in the LISA model and can
therefore automatically generate the code emitter for the respective mapping rule. The
output of the code emitter is symbolic assembly code, which will be further processed by
the register allocator and the instruction scheduler during code generation.
The mapping dialog also provides additional capabilities, e.g. for capturing rule attributes
like type-dependent conditions for rule matching or for reserving scratch registers for use in
complex multi-instruction rules, such as the above 32-bit multiply example.
The Compiler Designer supports a generic stack organization, which assumes the architec-
ture provides stack and frame pointer registers as well as register-offset addressing. Corre-
sponding to this generic stack model, the user has to assign instructions to some predefined
mapping rules needed for function prologue and epilogue code generation.
Providing the minimum set of mapping rules enables the generation of a working compiler
5.3. Synopsis 51
IR
ops
IR
ops
LISA opsLISA ops
mulmul
regreg®
regreg regreg
Nonter-
minals
Nonter-
minals
2
1
3
4
Figure 5.5: Mapping dialog
suitable for early architecture exploration. Naturally, at any time, the user may refine the
code selector by adding more dedicated mapping rules that efficiently cover special cases
leading to higher code quality.
The final output of the Compiler Designer is a compiler specification file in CoSy’s CGD format,
from which in turn a C/C++ compiler is generated fully automatically. During compiler retarget-
ing, the session status of the Compiler Designer can be saved in XML format and can be resumed
at any time.
5.3 Synopsis
• The Processor Designer environment supports all ASIP design phases.
• All software development tools, except the compiler, can be generated fully automatically.
• The C compiler components are extracted from the LISA model (e.g. scheduler tables) while
the largest part (code selector description) still needs to be retargeted manually.
Chapter 6
Code Selector Description Generation
In Section 3.3.2 it was mentioned that the code selector’s task is to map the IR to a semantically
equivalent sequence of machine instructions. A common technique for code selection is the tree
pattern matching technique which is also employed in the CoSy platform. Like in many other
ADLs, the required tree grammar must be manually specified in the Compiler Designer. Practical
experience showed that this is a time consuming, tedious and error prone task. Additionally, two
major drawbacks have been identified: First of all, the designer actually starts with an empty code
selector specification, i.e. he must have the knowledge about which code selector rules are necessary
to build a working compiler, able to translate arbitrary input programs. Secondly, the code selector
description from a previous architecture exploration phase may be inconsistent after a change in
the underlying ADL model (e.g. a rearrangement of the instruction-set hierarchy). In this case,
the code selector specification must be revised entirely. Unfortunately, major changes to the ADL
model are quite common in the early exploration phase when different architectural alternatives
are evaluated. This is further aggravated by the fact that the user is responsible for maintaining
the correctness of the mapping rules, since pure changes in the instruction behavior description,
without changing the hierarchy or the assembly coding, are not detected automatically. Hence,
this chapter presents a novel methodology to generate the code selector description automatically
from LISA processor models (Figure 6.1) which completely eliminates these problems.
Code Selector
Description
Generator
Code Selector
Description
Generator
LISA 2.0 Description
regs
data
mem
prog
mem
pipelinecontrol
prog
seq
IF/ID ID/EX/WB
LISA
Instruction
Semantics
Instruction
Semantics
Compiler Designer
Code Selector
Description
Code Selector
Description
Figure 6.1: Code selector description generation
The rest of this chapter is organized as follows: Section 6.1 elaborates the difficulties extracting
code selector relevant information from a given LISA model. The extension to the LISA model
53
54 Chapter 6. Code Selector Description Generation
required to circumvent them are presented in Section 6.2. Afterwards, Section 6.3 describes how
this information is used to enable the automatic generation of the code selector rules. Finally,
Section 6.4 describes the integration into the Compiler Designer.
6.1 The Semantic Gap
When the LISA language was initially developed the primary goal was to generate fast processor
simulators [248]. In the following the language was further refined and extended to be able
to describe a broader range of architectures as well as to enable the generation of the remaining
software development tools. Consequently, a LISA description has a rather simulator centric view,
i.e. the main focus in its design was to capture cycle and bit-true behavior of machine operations.
As a result, the LISA language allows arbitrary C/C++ descriptions of instruction semantics.
This feature ensures highest flexibility to describe how an instruction performs, but results in a
quite detailed ADL model. However, compiler generation requires rather the information what an
instruction does – which is quite difficult to extract from such “informal” models of instructions.
This semantic gap in particular complicates code selector rule generation. Consider the LISA
operation example shown in Listing 6.1. It describes an addition instructions which sets the carry
flag according to the result. Note that this operation (like all remaining operations in this chapter)
has no pipeline stage assigned and hence, belongs to an instruction-accurate LISA model.
OPERATION ADD {
DECLARE {
GROUP src1, dst = { reg };
GROUP src2 = { reg | | imm};
SYNTAX { "ADD" dst "=" src1 "," src2 }
CODING { 0b0000 src1 src2 dst }
BEHAVIOR {
dst = src1 + src2;
if ( ((src1 < 0) && (src2 < 0))
| | ((src1 > 0) && (src2 > 0) && (dst < 0))
| | ((src1 > 0) && (src2 < 0) && (src1 > −src2))
| | ((src1 < 0) && (src2 > 0) && (−src1 < src2)))
{ carry = 1; }
}
}
Listing 6.1: LISA operation for an ADD instruction
Even for this relatively simple operation, it is quite impossible to accurately extract the high-level
semantic meaning of the instruction automatically from the BEHAVIOR section. First of all, the
presented code is, due to numerous syntactic variances in C/C++, only one way to describe the
carry flag computation. This is further aggravated by the fact that once a pipeline is modeled,
6.2. SEMANTICS Section 55
this C/C++ instruction behavior description will be distributed over different pipeline stages
(cf. Section 5.1). Besides, the example does not model any architectural feature such as register
bypassing, side-effects, etc. which would lead to a much more complex description than what is
shown in the example.
Thus, in order to close the semantic gap, a new SEMANTICS section is introduced to LISA [112]. It
captures the instruction behavior at a higher abstraction level while ignoring all structural details
like pipelining for instance. This enables a clean and unambiguous way of describing instruction
semantics which in particular is suitable to generate code selector rules.
6.2 SEMANTICS Section
The requirements for a semantic operation description are as follows:
• Uniqueness, simplicity, and flexibility.
• A single, concise formalism to define the semantics, though still flexible enough to describe
even complex operations. Considering that the SEMANTICS sections and BEHAVIOR sec-
tions describe both the behavior of instructions, a concise description reduces redundancy
to a minimum.
• Legacy LISA models should be easily extendable to aid the compiler generation with only
minor additional design effort.
• For the purpose of compiler generation, ambiguity has to be strictly avoided.
The MIMOLA ADL [211] employs a set of so called micro-operations to describe a processor’s
instruction-set. Each micro-operation can be seen as a primitive operation similar to the instruc-
tions of a RISC instruction-set architecture. Complex instructions can be typically modeled by a
combination of such. This approach has been proven feasible and complete for the specification
of instruction semantics, but it is unsuitable for the description of complex micro-architectural
behavior as required for cycle-accurate simulators or HDL generation. Fortunately, this is already
covered by the BEHAVIOR section. Thus, the micro-operation idea is adapted for the definition
of the SEMANTICS section since it meets the requirements for the description of instruction’s
semantics very well.
6.2.1 Micro-operations
The examination of the instruction-set architectures of several contemporary embedded processors
revealed that the high-level behavior of most instructions are typically either arithmetic calcu-
lations using several operands or control-flow operations. The calculations carried out by the
instructions can be further decomposed into one or several primitive operations, whereas the set
56 Chapter 6. Code Selector Description Generation
of primitive operations is quite limited. However, to meet the aforementioned requirements, the
operations which should be included in the set of micro-operators must be carefully selected. For
instance, only those operators are of importance which are relevant for code selector generation.
It does not make sense to consider dedicated micro-operations for e.g. saturated arithmetic as
supported by many DSP architectures since the C language does not support saturated arith-
metic at all. Though at the same time it should be possible to describe those operations with
existing micro-operators. A comprehensive list of all available micro-operators can be found in
Appendix A.
The example in Listing 6.2 shows the ADD operation from the previous example using the
SEMANTICS section instead of the BEHAVIOR section.
OPERATION ADD {
DECLARE{
GROUP src1, dst = { reg };
GROUP src2 = { reg | | imm};
}
...
SEMANTICS{
ADD | C |(src1, src2)<0,32> −> dst;
}
}
Listing 6.2: Operation with semantics
OPERATION reg {
DECLARE{
LABEL index;
}
...
SEMANTICS{
REGI(R[index])<0..31>;
}
}
Listing 6.3: Operand’s semantics
A micro-operation is a tuple (o, S, U, v, w), consisting of the micro-operator o, the set of side effects
S ⊂ {C, V,N, Z}, the set of operands U and a bit-field specification represented by bit offset v
and bit width w. In the given example, the micro-operator _ADD defines the integer addition,
while the following _C specifies that the carry flag is affected by the operation. Other supported
flags are zero (_Z), negative (_N) and overflow (_V). A comma separated list of operands , i.e. src1
and src2, follows in parentheses. The <0,32> after the _ADD’s brackets explicitly specifies that
the result of the addition is 32-bit wide (bit 0 is the first bit). Hence, the corresponding tuple for
_ADD is ({C}, {src1, src2}, 0, 32). Finally, to build a complete semantic statement, the pointer ->
specifies the location for the result. Compared with the BEHAVIOR sections shown in Listing 6.1,
the description in the SEMANTICS section is much simpler.
The operands of the micro-operator can be either terminal elements, such as integer constants, or
other LISA operations like in the example. In the latter case, the respective operations must have
a SEMANTICS section on their own. In Listing 6.3, the SEMANTICS section of the reg operation
defines the semantic type of the operand. In this case it is a 32-bit integer register specified as
array R in the global RESOURCE section (not shown here).
Similar to the micro-operators, each operand of a micro-operation can be represented as a 3-tuple
(u, v, w) consisting of the value/resource u and a bit-field specification represented by bit offset
6.2. SEMANTICS Section 57
v and bit width w. Thus, the corresponding tuple for operation reg is (u, v, w) = (R[index], 0, 32).
If no explicit bit-field specification is provided for the micro-operator, it will deduce the specifica-
tion from the input operands. For instance, the addition of two operands (a, 0, 32) and (b, 0, 32)
results in the 3-tuple (c, 0, 32), where c is the result of the 32-bit addition of a and b. Thus, the
explicit bit-field specification <0,32> for ADD in the example is actually superfluous.
Note that the bit-width specification is compulsory for those micro-operations whose output
bit-width cannot be deduced from their operands, such as sign/zero extension for instance.
Furthermore, certain micro-operators have some implicit restrictions for the input operands
regarding the bit-width. An implicit constraint for the _ADD micro-operator is that both operands
share the same bit width. If that constraint is not met, the respective operand has to be extended
to match the width of the other operand by means of an explicit sign/zero extension. Two
separate micro-operations _SXT and _ZXT serve that purpose.
The generic micro-operation and operand representation allows for a very compact instruction-set
description while keeping the number of required micro-operations small.
6.2.2 Modeling Complex Operations
Obviously, not all instructions can be expressed by a single micro-operation. For instance, many
DSP processor architectures have instructions doing combined computation like Multiply and
Accumulate (MAC) for instance. Such behavior is captured in SEMANTICS sections by using a
micro-operation as the operand of another micro-operation, henceforth referred to as chaining.
OPERATION MAC{
DECLARE{
GROUP src1, src2, dst = { reg };
}
...
SEMANTICS{
ADD( MULUU(src1, src2)<0,32>,
dst) −> dst;
}
}
Listing 6.4: Micro-operation chaining
OPERATION SWAP{
DECLARE{
GROUP src = { reg };
}
...
SEMANTICS{
src<0,16> −> src<16,16>;
src<16,16> −> src<0,16>;
}
}
Listing 6.5: Multiple statements
A simple example of a MAC operation can be found in Listing 6.4. _MULUU is the micro-operator
that denotes the unsigned multiplication. Its result is used as one of the operands of the _ADD,
thus building a micro-operation chain. The bit-field specification in angle brackets is required to
ensure that both operands of _ADD have matching bit-widths.
58 Chapter 6. Code Selector Description Generation
The chaining mechanism helps to describe complex operations without introducing temporary
variables. This guarantees a tree-like structure for each semantic statement. Such trees are
well-suited for mapping rule generation since most code selection algorithms are based on the
tree-pattern matching technique.
In general, most of the RISC instructions can be modeled with one statement (including
chaining), but obviously this is not sufficient for those instructions transferring data to multiple
destinations. However, this can be modeled with multiple statements in the SEMANTICS sections.
It is assumed that all statements execute in parallel. Consequently, a preceding statement’s result
cannot be used as the input of the following statement. Listing 6.5 illustrates this. The SWAP
operation swaps the content of a register by exchanging the upper and lower 16 bits. Because
the execution is in parallel, the data in the register are exchanged safely without considering
sequential overriding.
Another kind of important behaviors used in modern processors is predicated execution, i.e. an
instruction is executed depending on certain conditions. In order to model such instructions,
IF-ELSE statements and comparison operators can be used in the SEMANTICS sections to model
all kinds of conditions. Of course, comparisons can be chained, too. Listing 6.6 gives an example
for an addition with carry bit. The _EQ operator checks whether the two input operands, an
integer constant and the carry flag, are equal or not. Depending on the result, the IF statement
will execute the code specified in the braces.
OPERATION CADD{
DECLARE{
GROUP src1, src2, dst = { reg };
}
...
SEMANTICS{
IF( EQ( CF,1)){
ADD(src1, src2) −> dst; }
}
}
Listing 6.6: IF-ELSE statement
OPERATION DCT2d{
DECLARE{
GROUP src,dst = { reg };
}
...
SEMANTICS{
" DCT2d"(src) −> dst;
}
}
Listing 6.7: Intrinsic micro-operation
Naturally, it is not possible to describe every instruction with the formalism presented above. For
instance, ASIPs often feature application specific instructions which behavioral description can
vary from only a few code lines to several hundred. Obviously, such complex behavior can hardly
or not at all be expressed with micro-operations. But this is actually no drawback since such
instruction cannot be directly exploited by today’s code selection techniques anyway. For such
instruction, a special intrinsicmicro-operation can be used as some sort of a wildcard. No semantic
6.2. SEMANTICS Section 59
meaning is associated with its description, just an user defined name. Listing 6.7 illustrates this.
With the capability of defining intrinsics, every instruction can be described in the SEMANTICS
sections. Intrinsic micro-operators are treated separately during mapping rule generation.
6.2.3 Semantics Hierarchy
Section 5.1 illustrated already the LISA operation hierarchy. This achieves modeling flexibility
and simplicity. Consequently, it has to be supported by the semantic description as well. Listing
6.8 provides an example. In the arithm operation, the GROUP opcode is used as micro-operator.
Consequently, the concrete micro-operators is obtained from the SEMANTICS sections of the
respective GROUP members. In this case, the SEMANTICS sections of the ADD and SUB operation
provide the corresponding micro-operator. The similarity of the ADD and SUB operation’s semantics
is well exploited here to simplify the description.
OPERATION arithm {
DECLARE{
GROUP src1, src2, dst = { reg };
GROUP opcode = { ADD | | SUB ...};
}
...
SEMANTICS{ opcode | C |(src1, src2)
−> dst; }
}
OPERATION ADD {
...
SEMANTICS{ ADD; }
}
OPERATION SUB {
...
SEMANTICS{ SUB; }
}
Listing 6.8: Hierarchical operators
OPERATION ADD {
DECLARE{
GROUP src1, dst = { reg };
GROUP opd = { SHL | | SHR };
}
...
SEMANTICS{ ADD(src1, opd)−> dst; }
}
OPERATION SHL {
DECLARE{
GROUP src2 = { reg };
GROUP imme = { imm };
}
...
SEMANTICS{ LSL(src2, imme); }
}
OPERATION SHR {
...
SEMANTICS{ LSR(src2, imme); }
}
Listing 6.9: Hierarchical operands
A SEMANTICS section can return not only a micro-operator but also a complete micro-operation
expression. In Listing 6.9, the SEMANTICS sections of the SHL and SHR operations do not contain
a complete statement with assignment but micro-operators with operands (_LSL and _LSR are
logical left and right shift micro-operators). As a result, the semantics of these two operations is
not self-contained, because the data sink is missing. The use of these two operations is actually
doing operand pre-processing for the ADD operation, which can be seen in its SEMANTICS section.
60 Chapter 6. Code Selector Description Generation
The opd GROUP, which contains the previous two operations, is used as one of the operands of
the _ADD micro-operation. Thereby, depending on the binary encoding of the actual instruction,
one of the operand registers will be left or right shifted before the addition is actually performed.
The presented formalism that defines the SEMANTICS sections is very flexible and well integrated
into LISA. If the commonalities of instructions are fully exploited, their instruction semantics can
mostly be described with a single or a few semantic statements.
6.3 Code Selector Description Generation
The code selector generator in CoSy uses the dynamic programming tree matching algorithm as
presented in Section 3.3.2. The tree grammar G = (T,N, P, S, w) consists of finite sets N and
T of nonterminal and terminal symbols, respectively, as well as a set P of mapping rules, the
corresponding cost metric w, and a start symbol S ∈ N . The terminals T essentially describe the
available IR operations of the given source language and thus are target machine independent.
Likewise, the start symbol S requires no special retargeting effort. Only the nonterminals N , and
the rules P , need to be adapted to the target machine. N basically reflects available instruction
operand kinds, e.g. registers, memories, and addressing modes like register-offset addressing for
instance, while P defines how source language operations are implemented by the target instruc-
tions. Each mapping rule in P has the form of a tree pattern that may serve to cover a data-flow
graph fragment during code selection.
RULE o:mirPlus (a:reg_nt, b:reg_nt) -> c:reg_nt;
CONDITION{ IS_INT(o) }
COST 1;
EMIT {
Print(“add %s = %s, %s“, REGNAME(c),
REGNAME(a),
REGNAME(b));
}
Result
nonterminal
Result
nonterminal
Input
operands
Input
operands
IR OperatorIR Operator
mirPlus
Reg
nt
Reg
nt
Reg
nt
Cost
metric
Cost
metric
Print function for
code emission
Print function for
code emission
Condition to
apply rule
Condition to
apply rule
Integer addition ?Integer addition ?
Assembly syntax
format string
Assembly syntax
format string
Operand register‘s
syntax name
Operand register‘s
syntax name
IR / operand
name
IR / operand
name
Tree Pattern
Figure 6.2: CoSy mapping rule syntax
Figure 6.2 shows a typical CoSy mapping rule specification. Each rule starts with the keyword
RULE followed by the tree pattern specification. In CoSy IR operators are named mirPlus,
mirMult etc. Each IR operator and each operand can be associated with a name for further
reference. Additionally, each rule has an (optional) CONDITION assigned that must be met before
6.3. Code Selector Description Generation 61
the rule can be applied. Here in the example, the rule only matches for integer additions (i.e.
floating point additions are not matched by the rule). Additionally, each rule has a fixed cost
assigned that is used by the tree pattern matching algorithm. Finally, the EMIT part contains a
print function that is executed by the code emitter, the final compiler phase, if the rule has been
selected. Here, it prints the add syntax including the physical register names that have been
assigned to the nonterminals during register allocation.
The following sections describe how the nonterminals N and the mapping rules P and the associ-
ated conditions are automatically generated from the instruction semantics information given in
the SEMANTICS sections (Figure 6.3).
LISA 2.0 Description
regs
data
mem
prog
mem
pipelinecontrol
prog
seq
IF/ID ID/EX/WB
LISA
SEMANTICS
One-to-one One-to-many
Many-to-one Intrinsics
Compiler Designer
Code Selector
Description
Code Selector
Description
Mapping Rule Generation
Nonterminal GenerationNonterminal Generation
Figure 6.3: Nonterminal and mapping rule generation
6.3.1 Nonterminal Generation
In tree grammar descriptions, nonterminals can be seen as temporary variables connecting different
grammar rules. In this way, they determine the expressive power of a tree grammar specification
to a large degree. Usually, each nonterminal corresponds to some feature of the target architecture
that is common to a number of instruction like registers and memory accesses for instance. Thus,
depending on the type of the temporary, nonterminals can be divided into the following three
categories:
Register nonterminals represent the registers which can be used by the compiler.
Immediate nonterminals carrying the constant values that can be used as immediate operands
in instructions.
Addressing mode nonterminals encapsulating the addressing modes supported by the target,
e.g. register-offset addressing.
Condition nonterminals which are typically condition flag registers that are affected by differ-
ent instructions, e.g. carry or zero flag.
62 Chapter 6. Code Selector Description Generation
In LISA processor models, accesses to these storage location or processor resources are usually
described by a wrapper operation, like the operation reg in Listing 6.3. A set of micro-operators is
available which captures the semantics of these wrappers. As mentioned above, the _REGI operator
in the example stands for a register access. Its operand R is the name of the corresponding LISA
resource that is used as register bank. The index of the accessed register is given by index, a LISA
label whose value is determined by the instruction encoding. Another important information, the
bit-width of the registers, is specified with the notation <0,32> which means the register is 32 bit
wide and the least significant bit is bit 0. From this specification a register nonterminal with the
given properties can be generated.
Likewise, the generation of the immediate and the addressing mode nonterminals is based on two
related micro-operators, namely _IMMI for immediates and _INDIR for memory references. The
latter is typically used in a micro-operator chain to describe more complex addressing modes.
In general, condition nonterminals represent the flag registers; their existence depends on the
use of the four predefined flags, namely carry (_C), zero (_Z), overflow (_O), and negative (_N)
flag. For example, the semantics statement in Listing 6.2 writes to the carry flag. Accordingly, a
condition nonterminal for the carry flag is generated.
These four kinds of nonterminals are processor specific elements in the mapping rules. The non-
terminal generator checks all available LISA operations for those micro-operators and creates the
corresponding nonterminals if they are not already existent. Afterwards, the algorithm proceeds
with the generation of the mapping rules.
6.3.2 Mapping Rule Generation
In general a mapping rule consists of three parts: a tree pattern, the result nonterminal produced
by the rule, and one or more associated machine instructions. The tree pattern represents a C-
level computation which can be performed by the processor. Likewise, the input operands of the
computations are usually also nonterminals. Thus, to generate mapping rules for a working code
selector description, mainly two questions need to be answered. The first one is
• which tree patterns are needed to cover the complete set of possible IR operations
and the second is,
• how the tree patterns are mapped to the target machine instruction-set.
Basic Rules
A complete code selector description must cover all IR tree patterns that the compiler frontend
may produce. Since the source language does not change, the IR tree patterns needed to be
covered by a code selector are actually fixed. Consequently, a set of mapping rule templates can
6.3. Code Selector Description Generation 63
be prepared without knowing the target processor. The set of such templates is called basic rules
further on. Listing 6.10 shows a basic rule along with a CoSy mapping rule in Listing 6.11 for an
addition of two registers which result is stored again in a register.
COSYIR mirPlus(a,b) −> c;
PATTERN {
ADD(a,b) −> c;
}
Listing 6.10: Basic rule example
RULE o:mirPlus(a:reg nt,b:reg nt) −> c:reg nt;
EMIT {
print("add %s = %s, %s",
REGNAME(c), REGNAME(a), REGNAME(b));
}
Listing 6.11: CoSy mapping rule
The mirPlus operator in both rules is an addition operation on the C level as defined in the CoSy
IR. Obviously, there are two major differences between basic rules and CoSy mapping rules. First,
the operands in the tree patterns of the basic rules (a, b, and c) are placeholders instead of the
nonterminal reg_nt used in the CoSy rules. In the same way it is possible to specify a condition for
each basic rule using these placeholders. However, for sake of simplicity, the following examples do
not include such conditions. The code selector generator keeps a so called basic library containing
the basic rules needed for a complete coverage of C operations. For each basic rule a list of tree
patterns is generated by replacing the placeholders with the generated nonterminals in all possible
combinations. Figure 6.4 illustrates this.
mirDiff(a, b)-> c; _SUB(a, b) -> c;
mirPlus(a, b)-> c;mirPlus(a, b)-> c; _ADD(a, b) -> c;_ADD(a, b) -> c;
Basic Rule Library Nonterminals
reg_nt, imm_nt
mirPlus(a:reg_nt, b:reg_nt) -> c:reg_nt; _ADD(reg_nt, reg_nt) -> reg_nt;
mirPlus(a:reg_nt, b:imm_nt) -> c:reg_nt; _ADD(reg_nt, imm_nt) -> reg_nt ;
… …
Figure 6.4: Tree pattern generation
Unfortunately, this can result in a large amount of mapping rules which must be processed. Some
of them may never be generated by the frontend or just do not make sense. For instance, a rule
covering a type-conversion from 16-bit to 32-bit only makes sense if the destination nonterminal
is at least 32-bit wide. Therefore, each basic rule can be annotated with certain constraints for
the nonterminals which can substitute the placeholders. For instance, restrictions on the kind
of nonterminal (register, immediate etc.) or the respective bit-width and bit-width relations of
64 Chapter 6. Code Selector Description Generation
the nonterminal operands. A comprehensive description of the complete library specification is
provided in Appendix B.
The second difference between basic rules and CoSy rules is that the latter is associated with an as-
sembly instructions, i.e. the print function prints the corresponding assembly instruction, while a
basic rule has one or more semantic statements assigned. Thus, the next task is to find suitable in-
structions in the LISA model which match the semantic statements of the generated tree patterns.
In most cases the generated tree patterns have only a single semantic statement that can be di-
rectly covered by a single instructions. This is denoted as one-to-one mapping. However, since
ASIP designs should always be as efficient as possible, rarely used instructions might be removed
from the design. Unfortunately, some of them might be needed for a complete code selector de-
scription. In this case one-to-many mapping is employed, which implements a semantic statement
with a sequence of instructions. Moreover, ASIP designers not only simplify the instruction-set
architecture but also add dedicated instructions for program hot spots. These instructions acceler-
ate the program execution by performing many C-level operations at once, like a MAC instruction
for instance. To utilize them in a compiler, many-to-one mapping rules are needed. For those
instructions containing an intrinsic micro-operator a corresponding compiler-known-function is
generated. The following sections describe how the instructions are selected for these four kinds
of mapping rules using the instruction semantics information in the LISA model.
One-to-one mapping
This mapping method is the first one applied by the code selector generator. The semantics
statements of the basic rules are compared with the available instruction semantics in the LISA
model. Both semantics match if the micro-operators, the operands and the bit-width specification
are the same. Figure 6.5 exemplifies this.
_ADD(reg_nt, reg_nt) -> reg_nt;
_ADD(_REGI(R[src1]),_REGI(R[src2])) -> _REGI(R[dst]);
Syntax : add dst, src1, src2
_ADD(_REGI(R[src1]), _ IMMI(value)) -> _REGI(R[dst]);
Syntax :add dst, src1, value
List of available Instruction Semantics
Generated Tree Pattern Semantics
Print(“add %s = %s, %s“,
REGNAME(c), REGNAME(a), REGNAME(b))mirPlus(a:reg_nt, b:reg_nt) -> c:reg_nt;
_SUB(_REGI(R[src1]), _REGI(R[src2])) -> _REGI(R[dst]);
Syntax: sub dst, src1, src2
Generated Tree Pattern Assigned Instruction
Figure 6.5: Matching rule semantics and instruction semantics
6.3. Code Selector Description Generation 65
Since some side effects in a real instruction might not be important for code selection, a successful
one-to-one mapping does not require two identical semantics patterns. For example, assume the
selected instruction semantics in Figure 6.5 would change the carry flag (i.e. _ADD|_C|). Since the
writing to the carry flag does not influence the result of an arithmetic addition, the side effects
in the instruction semantics can be ignored by the generator. Thus, the instruction can still be
selected for the generated tree pattern. Of course, such adaptation in the one-to-one mapping can
only compromise those effects not affecting the results of the calculation. The micro-operators,
operands and bit widths still must be exactly the same for both compared semantic statements.
One-to-many mapping
As mentioned above, not all semantic statements of the generated tree patterns can be covered
by a single instruction. However, for many semantic statements alternative implementation using
a sequence of semantic statements exist. In order to implement such an one-to-many mapping,
the code selector generator needs to know the alternatives for a given semantic statement. This
is specified by so called semantics transformations. An example is given in Figure 6.6.
ORIGINAL _NEG(a) ->b;
TRANSFORM{
_NOT(a)->b;
_ADD(b,1)-b;
}
ORIGINAL _NEG(a) ->b;
TRANSFORM {
_NOT(a)->b;
_ADD(b,1)-b;
}
Transformation
Unmapped Rule
mirNeg(a:reg_nt, b:reg_nt) -> c:reg_nt; _NEG(reg_nt , reg_nt) -> reg_nt;
mirNeg(a:reg_nt, b:reg_nt) -> c:reg_nt;
_NOT(reg_nt, reg_nt) -> reg_nt;
_ADD(reg_nt, 1) -> reg_nt;
Transformed Rule
Figure 6.6: Example for a semantic transformation
The _NEG micro-operator represents a two’s complement negation. The specified transformation
provides a mathematically equivalent solution to perform the negation. _NOT is the one’s com-
plement micro-operator. A two’s complement can also be calculated with an one’s complement
and adding one afterwards. Thus, if the generator fails to find an instruction for a tree pattern
covering a negation, it will then try to find a suitable instruction for each semantic statement in
the alternative implementation using the one-to-one mapping described above.
In principle this approach can be used to provide alternatives for nearly all semantic statements,
66 Chapter 6. Code Selector Description Generation
presumed that an equivalent transformation exists which can be expressed in the form of semantics
statements. However, because of the variance of different instructions implemented in various
architectures, it is not possible to specify transformations that fit every possible ISA. Nevertheless,
the basic library comes by default with a set of commonly used transformations, like e.g. shift
and/or mask operations as alternative to implement sign or zero extension. As will be explained
later, the basic library can also be extended with user defined transformation tailored to the
current ASIP design.
Many-to-one mapping
Many-to-one mapping is especially important for application specific instructions that perform
composite operations to accelerate the program execution. However, since the designers can
implement arbitrary combinations of operations in one instruction, it is obviously difficult to
provide basic rules without knowing what the instructions actually do. Therefore, the code selector
generation is inverted, i.e. instruction semantics in the LISA model which remain unused after
the previous steps create a tree pattern on their own. For example, consider the MAC instruction
in Listing 6.4, which is a commonly used composite operation. Two micro-operators are used,
_ADD and _MULUU, an unsigned integer multiplication. The generator knows the mapping between
the semantics micro-operators and the CoSy tree pattern nodes. Using this knowledge, it can
create a corresponding tree pattern from the instruction semantics without user interaction. In
the example, mirPlus is the CoSy tree pattern node corresponding to the micro-operator _ADD,
and mirMult maps to the _MULUU operator. If the source code contains a concatenated multiply
and addition operation, this many-to-one mapping rule can then be employed by the code selector
to use the MAC instruction instead of separate multiply and addition instructions.
_ADD(_REGI(R[dst]), _MULUU(_REGI(R[src2]),_REGI(R[src2]))) -> _REGI(R[dst]);
Syntax : mac dst, src1, src2
Instruction Semantics for a MAC
Print(“mac %s, %s, %s“, REGNAME(c),
REGNAME(a), REGNAME(b))
mirPlus(c:reg_nt,mirMult(a:reg_nt , b:reg_nt))
-> c:reg_nt;
Generated Tree Pattern
Figure 6.7: Many to one mapping for a MAC instruction
Intrinsics
Generally, the many-to-one mapping works fine for arithmetic instructions whose semantics can be
described with a chain of micro-operations. As mentioned in Section 3.3.2, tree pattern matching
fails in case instructions exceed the scope of a single DFT, like SIMD (Single Instruction Multiple
Data) instructions for instance. Other instructions are just too complex and can only be described
6.4. Compiler Designer Integration 67
using the intrinsic micro-operator as introduced in Section 6.2.2. Many compilers, though, pro-
vide support for these kind of instructions via Compiler Known Functions (CKFs) or intrinsics.
Basically, CKFs make assembly instructions accessible within the high level source code, where
the compiler expands a CKF call like a macro. In order to integrate support for those instructions
as well, the code selector generator creates for each instruction with an intrinsic micro-operator
a CKF function definition for the compiler’s internal function prototype list and a mapping rule
matching this particular CKF. As depicted in Figure 6.8 this is basically an one-to-one translation.
„_DCT2d“(_REGI(R[src]) ->_REGI(R[dst];
int DCT2d(int);
mirFuncCall(a:reg_nt) -> b:reg_nt
CONDITION { FuncCallType == „DCT2d“ }
Intrinsic micro-operator
CoSy mapping rule matching this CKF
Compiler‘s internal CKF prototype definition
Figure 6.8: CKF generation
6.4 Compiler Designer Integration
The mapping rule generator seamlessly complements the Compiler Designer (Figure 6.9). Basi-
cally, the nonterminals are already generated when the tool starts up. Afterwards, the mapping
rule generation can be started with a push button and the generated rules are displayed. However,
as mentioned above, certain mapping rules may still remain unmapped after the rule generation
since the ASIP design probably does not feature all required instructions. It might also happen
that the designer wants to create additional mapping rules in order to improve the code selector
description. In either case, the designer can use the mechanism described in Section 5.2 to assign
an instruction manually or to create new mapping rules.
In the early architecture exploration phase, when the design changes quite often and consequently
many compiler configurations are generated, this manual step must be done over and over again.
In order to avoid this repetition, the user can specify a so called target specific library, basically an
extension of the basic library, which contains additional mapping rules or target specific semantic
transformations to automate this process.
6.5 Synopsis
• Due to the semantic gap it is not possible to extract instruction semantics as required for
code selector generation from detailed instruction behavior descriptions.
68 Chapter 6. Code Selector Description Generation
.c.c
Processor
Designer
AssemblerAssemblerISSISS
CompilerDesigner
.CGD
C compilerC compiler
CoSy
.c
......
Design
goals
met ?
Design
goals
met ? LinkerLinker
Code selectorCode selector
SchedulerScheduler
Code emitterCode emitter
LISA 2.0 Description
regs
data
mem
prog
mem
pipeline control
prog
seq
IF/ID ID/EX/WB
LISA
SEMANTICS
Target
Specific
Library
Basic
Library
One-to-one One-to-many
Many-to-one Intrinsics
Mapping Rule Generation
Nonterminal GenerationNonterminal Generation
Figure 6.9: Design flow with automatic code selector generation
• The instruction semantics are captured by extending the ADL. A formalism for the descrip-
tion of instruction semantics is presented.
• The code selector generation consists of two phases, namely nonterminal generation and
mapping rule generation. The latter utilizes four different methods to generate the code
selector description fully automatically.
• The presented approach is integrated into the Compiler Designer. This complements the
Processor Designer framework such that the automatic generation of all software develop-
ment tools from an abstract processor model is achieved.
Chapter 7
Results for SEMANTICS based Compiler Generation
This chapter gives detailed account of the feasibility of the semantics based approach for C compiler
generation and the quality of the generated compilers.
7.1 Case Studies
In order to investigate the feasibility of modeling instruction’s semantic with the methodol-
ogy described in the previous chapter, several existing LISA models have been enhanced with
SEMANTICS sections for compiler generation. This includes both Instruction Accurate (IA) and
Cycle Accurate (CA) LISA models. More specifically, the following cores have been used: the
ARM7, the CoWare LTRISC processor, the STMicroelectronics ST220 VLIW (4-issue slots) mul-
timedia processor [75], the Infineon PP32 network processing unit, the Texas Instruments C54x
digital signal processor [236], the MIPS4K [161] and the NXP Semiconductors TriMedia VLIW
(5-issue slots) multimedia processor [169]. The LTRISC processor is a fully functional RISC tem-
plate included in CoWare’s Processor Designer. The PP32 is an evolution of [253] and comprises
bit-field instructions. Although the SEMANTICS section is not intended for the extension of al-
ready existing models, this approach proved that the new section does not impose any particular
modeling style – which is crucial w.r.t. LISA’s flexibility paradigm. All models have been enhanced
without any changes to the already existing specification.
ARM7 LTRISC ST220 PP32 C54x MIPS TriMedia
Abstraction Level IA IA CA CA CA IA CA
ISA RISC RISC RISC RISC CISC RISC RISC
No. operations 108 39 121 151 408 153 265
Design effort ∆ 4d 2d 10d 8d 15d 5d 12d
Table 7.1: SEMANTICS section statistics
69
70 Chapter 7. Results for SEMANTICS based Compiler Generation
Table 7.1 summarizes the results. Note that the design effort for adding semantics to the existing
models is given in man-days. Obviously, the work for adding SEMANTICS sections scales with
the number of operations in the architecture. In case of the TriMedia this is not entirely true
since many instructions are actually duplicated with marginal changes. This is due to TriMedia’s
capability of execution certain instructions conditionally, i.e. each case (conditionally / not condi-
tionally executed) is modeled with its own operation. The complexity of the instruction-set (RISC
vs. CISC) influences the effort, too. Generally, the effort for describing instruction semantics is
much less than for a behavioral description in C. For instance, a 19x19 multiplication can be
easily described with a single micro-operation and corresponding bit-field specifications whereas
a behavioral description usually requires a significant amount of C code which additionally has
to be validated. In particular for the PP32, the explicit bit-field specification for the semantics
(compared to a typical description in C using and/or/shift operations) reduces the design time
significantly.
7.2 Mapping Rule Generation
Among the LISA models with SEMANTICS sections the ST220, the PP32 and the MIPS have
been selected to evaluate the mapping rule generator. The resulting code quality is compared to
a CoSy compiler with hand-crafted mapping rules as well as a non CoSy based compiler. Both
CoSy compilers are generated using the Compiler Designer tool. The ISA characteristics relevant
for mapping rule generation are as follows:
ST220 The ST220 VLIW core is part of STMicroelectronics ST200 scalable and customizable
core family, designed to be embedded into multimedia SoC devices. It can execute up to four
instructions per clock cycle and features a multiplication unit. The load/store architecture
incorporates two register files, one consists of 64 registers which are 32 bit wide and the other
contains 8, one bit wide branch registers. Each branch register can be used for condition
testing and conditional branches. Register/offset addressing is the only supported addressing
mode.
PP32 The Protocol Processor (PP) has a RISC-based ISA with a single issue slot, implemented
in a four stage pipeline. It is a typical Harvard architecture with separate program memory
access. Among others, register-offset addressing is supported for load/store operations. The
PP features extensions for bit field operations which are optimized for single cycle processing
of arbitrary bit patterns without additional shift/mask operations. The global register file
consists of 16 elements, each having a data-word width of 32 bit. Conditional branches
are executed depending on the status of the carry/zero flag while comparisons are mostly
performed by separate instructions.
7.2. Mapping Rule Generation 71
MIPS The MIPS is a 32 bit RISC core implementing the well known MIPS32 ISA [161]. It
features 32 general purpose registers which are 32 bit wide and two special purpose register
for the multiply-divide unit. Again, the register-offset addressing mode is the only supported
one. Conditional branches can perform the comparison themselves or just depend on the zero
flag status. However, single instructions for (some) comparison operations are supported as
well.
For all architectures the typical set of nonterminals (i.e. register, immediate, addressing mode)
is automatically generated. During the initial run most of the resulting mapping rules for all
processors get a suitable instruction automatically assigned. To handle the unassigned rules as
well, all processors required a few custom transformations and/or mapping rules in the target
library. The CPU time used by the generator is negligible. Table 7.2 provides the statistics of
the generated nonterminals (NT) and rules for all processors as well as the number of required
custom transformations.
NT one-to-one one-to-many many-to-one custom rules custom trans.
ST220 9 176 13 5 4 4
PP32 9 71 19 0 5 6
MIPS 5 49 61 0 4 5
Table 7.2: Rule statistics for ST220, PP32, and MIPS
The custom rules and transformations are mainly used for those rules which cannot be executed
with a single machine instruction such as the signed/unsigned division, modulo operation (PP32,
ST220) and multiplication (PP32). The custom entries in the target library map those to func-
tion calls to the runtime library which provides a software implementation to accomplish such
operations. For the ST220, an additional transformation is used to perform the one’s complement
operation with an instruction performing a bitwise not and or at once. The PP32 also needs
some very specific transformations. For example, the load of a 32 bit immediate value has to
be performed with two instructions. The first one loads the higher half of the value into the
destination register and left shifts the result by 16 bits at the same time. The second one adds
the remaining lower 16 bits to the target register. A similar transformation is required for the
MIPS. Additionally, the latter needs custom transformations for some compare conditions since
they are not available in the MIPS ISA and must be performed in a different way. However, the
specification of custom transformation in the target library is an one time effort. Afterwards, the
complete code selector specification can be generated fully automatically.
72 Chapter 7. Results for SEMANTICS based Compiler Generation
7.3 Compiler Evaluation
The following sections evaluate the code quality for the different target architectures. The CoSy
compiler with hand-crafted code selector specification is used as baseline for evaluation. The CoSy
compiler with generated code selector specification and a non CoSy based compiler is compared to
it. In case of the ST220, this is the highly optimizing vendor compiler named ST Multiflow and for
the MIPS the gcc [78] based compiler. However, for the PP32 there is no vendor compiler available.
Instead, the lcc compiler [42] has been manually retargeted to the PP32 as additional reference
point. All CoSy based compilers have been verified using the Supertest compiler validation test
suite from ACE [29]. It took several man weeks to validate the compilers with hand-crafted code
selector specification in contrast to the compilers with generated code selector specification which
passed the test out of the box.
It can be expected that the compilers with generated code selector specification show a certain
overhead in code quality. This is mainly due to the fact that the basic rules are designed to fit for
many different architectures and consequently, might not optimal for certain target processors.
Additionally, the hand crafted code selector can exploit certain architecture properties, e.g. the
integral promotion for some of the C arithmetic operators can be omitted under the assumption
that the values in the registers are always correctly sign or zero extended. The generated rules
instead must always guarantee the correct behavior and might be too conservative in such cases. Of
course, the user can always enrich the target specific library to improve the generated code selector
description. However, except for the custom transformation required to enable the generation of
the complete code selector description, optimized target rules are not specified for this evaluation.
The concrete overhead for each architecture will be quantified in the following.
7.3.1 PP32
Figure 7.1 and 7.2 show the relative cycle count and code size for seven benchmarks extracted
from NPU applications, with the CoSy compiler using the hand-crafted code selector set to 100%.
For most benchmarks, the code quality of the compiler generated from the semantic description
is close to the hand-crafted version. However, in some cases a large code quality overhead can be
observed. This is mainly caused by the multiplication rules. As mentioned above, some custom
transformations map the multiplication to a software implementation in the runtime library. This
generic approach makes the transformation feasible for many target architectures. The hand-
crafted compiler in contrast employs an optimized assembly program for this purpose which is
significantly faster. However, the user could also create custom transformation that yields ex-
actly the same assembly routine for the multiplication (Listing 7.1). But this optimization is
usually done when the architecture exploration phase converges and an initial working compiler
is available.
7.3. Compiler Evaluation 73
ORIGINAL MULII(REGISTER a, REGISTER b) −> REGISTER c;
SCRATCH t1,t2;
TRANSFORM{
0 −> t1;
0 −> t2;
b −> c;
LLabel 0:
IF ( EQ(b<0,1>,0)) {
ADD( PC,LLabel 1<0,13>) −> PC;
}
ADD(t1,a) − > t1;
LLabel 1:
LSR(b, 1) −> b;
t1<0,1> −> b<31,1>;
LSR(t1, 1) −> t1;
t1<30,1> −> t1<31,1>;
IF ( EQ( SUB(t2, 1),0)) {
ADD( PC,LLabel 0<0,13>) −> PC;
}
SUB(t2,1) −> t2;
}
Listing 7.1: PP32 specific transformation for multiplication
Thanks to a richer set of built-in code optimization techniques, the CoSy based compilers always
outperform the lcc w.r.t. the cycle count. Since the lcc’s code selector basically corresponds to
the hand-crafted CoSy compiler, the code size of both compilers is almost the same.
0%
50%
100%
150%
200%
250%
300%
frag tos hwacc route reed md5 crc
R
e
l.
c
y
c
le
c
o
u
n
t
in
%
CoSy hand-crafted CoSy generated lcc
Figure 7.1: Relative cycle count PP32
74 Chapter 7. Results for SEMANTICS based Compiler Generation
0%
50%
100%
150%
200%
250%
300%
350%
400%
frag tos hwacc route reed md5 crc
R
e
l.
c
o
d
e
s
iz
e
t
in
%
CoSy hand-crafted CoSy generated lcc
Figure 7.2: Relative code size PP32
7.3.2 ST220
The picture is different for the ST220. Figure 7.3 and 7.4 illustrate the results for several kernels
taken from the DSPstone benchmark suite [110] and a prime number computation based on the
Sieve of Eratosthenes. The code quality of the compiler generated from the semantic description
shows on average an overhead of 5% in cycle count and 18% in code size as compared to the
hand-crafted version. The overhead is less than for the PP32 firstly because there is no issue
with the multiplication implementation (the ST220 supports multiplication). Secondly, only few
of the one-to-many mapping rules (cf. Table 7.2) have an one-to-one mapping in the hand-crafted
version.
Compared to the ST Multiflow compiler, the CoSy based compilers show an average overhead of
75% in cycle count and 99% in code size, partially due to extensive function inlining. These are
acceptable values, taking into account that the development time for the ST Multiflow compiler
probably was orders of magnitude higher and the CoSy based compilers are essentially “out-of-
the-box” generated compilers without machine-specific optimizations. Analysis of the generated
code showed that by adding custom optimization engines, e.g. for exploiting predicated execution,
significantly higher code quality could be easily achieved.
7.3.3 MIPS
The results for the MIPS, depicted in Figure 7.5 and 7.6, show a similar picture as for the PP32.
Apart from the benchmarks as used for the ST220, larger kernels from different benchmark suites
[49, 135] or applications [173, 242] haven been chosen. The compiler generated from the semantic
descriptions shows an average overhead of 88% in cycle count and 45% in code size. In contrast
7.3. Compiler Evaluation 75
0%
20%
40%
60%
80%
100%
120%
140%
fir dct adpcm fht viterbigsm sieve
R
e
l.
c
y
c
le
c
o
u
n
t
in
%
CoSy hand-crafted CoSy generated ST Multiflow
Figure 7.3: Relative cycle count ST220
0%
20%
40%
60%
80%
100%
120%
140%
fir dct adpcm fht viterbigsm sieve
R
e
l.
c
o
d
e
s
iz
e
in
%
CoSy hand-crafted CoSy generated ST Multiflow
Figure 7.4: Relative code size ST220
to the previous hand-crafted CoSy compilers, a considerable amount of work has been spent in
the code selector specification for the MIPS. In another context it was evaluated how close a
CoSy compiler generated by the Compiler Designer can come to a production quality compiler.
Consequently, a larger overhead for the semantic based compiler can be observed. The hand-
crafted compiler shows only an overhead of 5% in cycle count as compared to the gcc. Code size
numbers for the gcc are omitted since it uses a different runtime setup (i.e. functionality that is
linked to the executable to setup the runtime environment) which leads to a significantly different
code size.
76 Chapter 7. Results for SEMANTICS based Compiler Generation
0%
50%
100%
150%
200%
250%
300%
sieve adpcm miniLzo blowfish libmad cjpeg djpeg jpegtrans
R
e
l.
c
y
c
le
c
o
u
n
t
in
%
CoSy hand-crafted CoSy generated GCC
Figure 7.5: Relative cycle count MIPS
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%
sieve adpcm miniLzo blowfish libmad cjpeg djpeg jpegtrans
R
e
l.
c
o
d
e
s
iz
e
in
%
CoSy hand-crafted CoSy generated
Figure 7.6: Relative code size MIPS
7.4 Conclusions
Designing an ADL that in particular serves the purpose of C compiler and simulator generation
from a single model is quite challenging (cf. Chapter 4). Typically, this leads either to a loss in
modeling flexibility or introduces a huge potential for inconsistencies. This thesis presented an
approach for the LISA ADL that avoids both. It incorporates a new SEMANTICS section into
the LISA language definition which achieves a concise formalism for the description of instruction
semantics without influencing the existing flexibility. This information is used by four different
mapping rule generation methods which create the code selector description for a C compiler fully
automatically. In this way, even non compiler experts are capable of generating C compilers for
early architecture exploration. Manually created code selector descriptions are a typical source
of errors, but the generated code selector rules are correct by construction. Hence, a significant
7.4. Conclusions 77
verification and debug effort is saved.
Although using a semantics description introduces certain redundancies, they are kept minimal
in the model. Note that apart from code selector generation, it is also possible to generate an
instruction-set simulator and documentation with the information provided by the SEMANTICS
sections [81]. Since the semantics description is much simpler than the C/C++ description,
this helps accelerating the modeling process in early architecture exploration when the concrete
micro-architecture is not fully determined. However, a detailed discussion of the simulator
generation is beyond the scope of this thesis.
From the above case studies it should be obvious that the flexibility of the new SEMANTICS section
w.r.t. feasible target architecture classes is not a major concern in this approach. Furthermore, C
compilers can now be generated fully automatically from LISA models with SEMANTICS sections.
Such an integrated approach, based on only a single “golden” target processor model, is key for an
effective ASIP design environment. The resulting lower code quality of the generated compilers
is acceptable considering that the C compiler is available right from the beginning.
Compared to compiler generation with a pure stand-alone system like CoSy or with the Compiler
Designer without code selector generation, the compiler description effort is reduced to a minimum.
Moreover, the presented approach hides even more compiler technology internals from the ASIP
design engineer, who thus can better concentrate on architecture optimization. Another advantage
is that the code selector rules are correct by construction. This eliminates a prominent source of
errors in compiler descriptions.
The code quality of the generated compilers can only be considered as result from “out-of-the-box”
compilers. Analysis of the generated code showed that by adding custom optimization engines,
e.g. for exploiting predicated execution, significantly higher code quality could be easily achieved,
though, at the expense of higher manual effort. Furthermore, while the integration of high-level
optimizations into retargetable compilers is mostly supported, this is not the case for low-level or
assembly-level optimizations. Most generated assemblers do not offer the opportunity to plug-in
user defined optimizations. Therefore, the remainder of this thesis focuses on two topics:
• Retargetable optimization techniques for common ASIP extensions to further narrow the
code quality gap while reducing compiler design effort.
• A new retargetable assembler provides an implementation interface to quickly develop user-
defined optimization techniques.
Chapter 8
SIMD Optimization
As concluded in the previous chapter retargetable compilers, as used in ASIP design environments,
are still hampered by their limited code quality as compared to hand-written compilers or assem-
bly code. Consequently, generated compilers must be manually refined to a highly-optimizing
compiler after successful architecture exploration. One way of overcoming this dilemma is to de-
sign retargetable optimizations for those architectural features which characterize a class of target
processors.
+
Register A RegisterB
Register C
+
0 31 0 31
0 31
Memory
32-bit, word aligned
Load / Pack
Store / Unpack
Sub-register A1 Sub-register A2 Sub-register B1 Sub-register B2
Sub-register C1=A1+B1 Sub-register C2=A2+B2
Load / Pack
Figure 8.1: Sample arithmetic SIMD instruction: two parallel ADDs on 16-bit sub-registers of
32-bit data registers A, B, C; the data is loaded/stored at once from/to an alignment boundary
This chapter focuses on target processors equipped with SIMD instructions. As illustrated in
Figure 8.1, a SIMD instruction performs several primitive operations in parallel, using operands
from several sub-registers of the processor’s data registers at a time. The operands are typically 8-,
16- or even 32-bit wide. In future, the SIMD data paths might even grow larger with the advances
in semiconductor technology. Other typical SIMD instructions perform more complex operations
79
80 Chapter 8. SIMD Optimization
(e.g. partial dot products) or serve for sub-register packing and permutation. From a hardware
perspective, SIMD instructions are easy to control and have a simple structure (the existing data
path is basically just split) without extra register file ports. This makes them inherently simple
and thus keeps the hardware cost low. Meanwhile, they can provide significant performance
improvements for computation-intensive multimedia workloads [128]. Therefore, many embedded
processors for the next generation of high-end video and multimedia devices today feature SIMD
instructions.
The SIMD concept debuted in general purpose architectures such as Intel MMX/SSE1–5,
IBM/Motorola VMX/AltiVec and AMD 3DNow!. Later on it was introduced in domain-specific
processors (e.g. TI C6x, NXP TriMedia) and in recent custom ASIP designs (e.g. Tensilica
Xtensa). Even some versions of the popular ARM and MIPS based architectures feature SIMD
instructions. While several target-specific C compilers already exploit SIMD instructions, there is
almost no support in ASIP compilers. Consequently, there is an increasing interest in retargetable
compilers with SIMD support. For use in this domain, retargetable SIMD optimizations are
required. This chapter presents a novel concept for retargetable code optimization for ASIPs with
SIMD instructions, and this concept is proven by an implementation within the CoSy compiler
that can be retargeted via the Compiler Designer GUI and an experimental evaluation for two
real-life embedded processors.
The rest of this chapter is organized as follows. In Section 8.1 related work is discussed. The core
of the SIMD framework is presented in Section 8.2 before the retargeting procedure is described in
Section 8.3. Afterwards, Section 8.4 provides the experiments for different embedded processors
with SIMD support. Finally, Section 8.5 summarizes the contribution of this approach and points
to some future avenues of work.
8.1 Related Work
Traditional code selection typically relies on tree parsing. As mentioned in Section 3.3.2, tree
parsing is not suited to exploit SIMD instructions because they exceed the scope of a single DFT.
Consequently, compilers require advanced techniques to exploit SIMD instructions.
Most of the current SIMD optimization techniques are based on the traditional loop-based
vectorization [189, 84, 22, 188]. Others make use of instruction packing techniques in conjunction
with loop-unrolling to exploit data parallelism within a basic block [217] or a combination of
traditional code selection [43] and integer linear programming [197, 24]. As investigated in [89],
it is often difficult to apply SIMD optimization techniques since these architectures are largely
nonuniform, featuring specialized functionalities, constrained memory accesses and a limited
set of data types. Moreover, complicated loop transformation techniques are needed [189] to
exhibit the necessary, architecture dependant amount of parallelism in the code. Another hurdle
8.1. Related Work 81
to applying SIMD techniques is packing of data elements into registers and the limitations
of the SIMD memory unit: Typically, SIMD memory units provide access only to contiguous
memory elements, often with additional alignment constraints. Computations, however, may
access the memory in an order which is neither adequately aligned nor contiguous. Besides,
operations on disjoint vector elements are usually not supported. The detection of misaligned
pointer references is presented in [104]. Certain misalignments can be solved either by loop
transformations [84, 218] or by data permutation instructions. The efficient representation and
generation of such instructions is investigated in [5, 188, 63] and the optimization thereof in
[24, 90]. Consequently, only a successful interaction of several optimization modules will be able
to leverage SIMD optimization for retargetable compilers.
So far, only advanced compilers (e.g. the Intel compiler [108], IBM XL compiler [5]) are capable
of automatically utilizing SIMD instructions. Apart from being inherently non-retargetable these
compilers are mostly restricted to certain C language constructs. Other compilers use dedicated
input languages for source-to-source transformations which are restricted to a certain application
domain [74, 167]. The vast majority of the compilers, though, still provide only semi-automatic
SIMD support via Compiler Known Functions (CKFs). Understandably, this assembly like pro-
gramming style is tedious and error prone. Moreover, this comes along with poor maintainability
and portability of the code.
Among the ASIP design platforms mentioned in Chapter 4, so far only Tensilica’s com-
piler includes SIMD support. However, its architectural scope is limited to the configurable
Xtensa processor [191]. Considering retargetable compilers, recent versions of the gcc sup-
port SIMD for certain loop constructs [77]. Unfortunately, gcc is mainly designed for general
purpose processors and as a result, does not adapt efficiently to embedded processor architectures.
Summarized, several SIMD utilization concepts with different levels of complexity are available.
However, they are mostly implemented in target specific compilers. Consequently, adapting a
SIMD optimization concept to a new target processor becomes a time-consuming and error prone
manual process. Therefore, this thesis presents an approach for the efficient utilization of SIMD
instructions while achieving compiler retargetability at the same time. The presented SIMD op-
timization comprises a loop-vectorizer and an unroll-and-pack based technique [147], which are
both driven by the same SIMD specification. The retargeting formalism is fully integrated into
the compiler backend specification. The advantage is that many generators for the standard back-
end components (e.g. the code selector) can be reused for the SIMD optimization to a great extent.
This reduces the retargeting effort and enables greater flexibility to specify the SIMD architecture.
The amount of required target specific information is limited, so that most of it can be extracted
automatically from ADL descriptions such as LISA. Moreover, the retargeting information is also
used to steer the loop transformations, such as unrolling and strip-mining, required to exhibit the
82 Chapter 8. SIMD Optimization
necessary (i.e. SIMD architecture dependant) amount of parallelism and to deal with memory
alignment issues. In sum, this provides a flexible and efficient SIMD optimization framework for
a wide variety of SIMD architectures.
8.2 SIMD Framework
As mentioned above, a successful SIMD optimization is tightly coupled with several loop trans-
formations in order to exhibit the necessary amount of parallelism and to convert loops into a
proper form. Hence, the presented approach consists of several steps as depicted in Figure 8.2.
IRIR
IR+
SIMD
IR +
SIMD
Loop Car.
Data Dep. Alignment
SIMD
Analysis
Strip Min. /
Loop Peel.
Scalar
Expansion
Vectorizer
SIMDfyer
Loop
Unroll
Strip size
Unroll Factor
SIMD Candidates
SIMD datapath
width
Scalar Expansion
profitable ?
Vectorization
feasible ?
Figure 8.2: SIMD code generation flow
First of all, a loop carried dependency [159] and alignment analysis (Section 8.2.3) are performed.
They provide the necessary annotation needed by the SIMD optimization framework. Afterwards,
a SIMD Analysis (Section 8.2.4) searches for loops where SIMD optimization could be applied.
For these loops it determines the parameters for the different loop transformations (Section 8.2.5 -
8.2.8). Finally, the SIMD optimization is performed, comprising a loop vectorizer (Section 8.2.7)
or an unroll-and-pack based SIMDfyer (Section 8.2.9) if vectorization fails. All modules are driven
by the same, retargetable SIMD specification described in Section 8.3.
8.2.1 Basic Design Decisions
A basic design decision concerns the representation of generated SIMD instructions in the com-
piler’s IR. All IR formats comprise elements for representing primitive operations like addition,
subtraction, multiplication, and so on. However, there are usually no dedicated IR elements for
8.2. SIMD Framework 83
SIMD operations such as “two parallel additions”. Extending the underlying IR format is not a
practicable solution. All already existing compiler engines would have to be manually adapted
in order to handle the new IR elements. Otherwise compiler engines might not exploit the full
optimization potential or may even fail in the worst case. In either case poor code quality would
be the result. Therefore, generated SIMD instructions are internally represented in the form of
CKFs. CKFs are transparent for other compiler modules and are later automatically replaced
with assembly instructions in the backend. They are not visible to the compiler user at all. Fur-
thermore, CKFs simplifies code generation to a certain extend, since it abstracts from low-level
problems like register allocation for SIMD sub-registers in the backend. Moreover, all existing
code generation and optimization engines of the underlying compiler framework can simply be
reused. This includes the existing debug facilities of the compiler platform. In this way, the
current IR state can be dumped into a human-readable, valid C code file at any time during the
SIMD generation process.
8.2.2 Terminology
Here, the terminology that facilitates the description of the optimization modules in the next
sections is briefly introduced. As exemplified in Figure 8.1, a SIMD instruction performs indepen-
dent, usually identical operations on a certain bit range within the input register and also writing
the results to a corresponding range in the output register. In other words, a SIMD instruction
splits a full register into k sub-registers (frequently k = 2 or k = 4). In the given example the
lower and upper parts of the arguments are added and written to the lower and upper part of
the destination register respectively. Thus, this SIMD instruction operates on 2 sub-registers. A
single, primitive operation within the SIMD instruction (e.g. the 16-bit addition) is denoted as
a SIMD candidate. It is basically a mapping rule covering this primitive operation. From these
mapping rules a SIMD candidate matcher (Section 8.3.1) is generated (i.e. a regular tree pattern
matcher) that is used for the identification of such SIMD candidates.
A set of SIMD candidates that can be combined into a SIMD instruction is denoted as a SIMD-set.
For this purpose a generated SIMD-set constructor is employed (Section 8.3.2). This is basically a
combination function that tries to collect suitable SIMD candidates under given constraints such
that a valid SIMD-set can be built. The algorithm for SIMD-set constructions assumes that the
results from the data flow analysis are already available. Next, it checks a number of constraints
for tuples N = (n1, . . . , nk) of SIMD candidates, where k denotes the number of sub-registers,
nodes ni of a potential SIMD-set must
1. represent isomorphic operations that can be combined to a SIMD instruction according to
the target machine description.
2. show no direct or indirect dependencies that would prevent parallelism. While this can be
analyzed relatively simple for scalar variables, it becomes quite difficult in case of array and
84 Chapter 8. SIMD Optimization
pointer accesses.
3. fulfill alignment constraints of the given target architecture. The data elements in memory
must be packed in a single register in advance before the SIMD instruction can be executed.
This involves wide load instructions and hence, possibly memory alignment constraints as
well as reordering of sub-register within a register using special pack and permute instruc-
tions. The same holds for storing the SIMD result again in memory.
A constructed SIMD-set (i.e. the related IR nodes) can then be replaced by a CKF call. The
regular code selector description is enriched with CKF mapping rules so that later during the
code emission phase the proper assembly code for the SIMD instruction can be emitted.
8.2.3 Alignment Analysis
The SIMD memory unit usually implies certain constraints on the memory access. For example, a
two-fold SIMD instruction operating on 16-bit data types typically uses a 32-bit wide, word aligned
load operation to pack them at once in a 32-bit register (Figure 8.3). If the word alignment cannot
Sub-register 1 Sub-register 2
32-bit load
Alignment =2
Sub-register 1 Sub-register 2
32-bit load
Alignment = 0
8-bitSIMD memory boundary
Register
Figure 8.3: SIMD alignment constraint
be assured at compile-time, additional code (i.e. a dynamic alignment check) is required to ensure
correct alignment during run-time. The strip mining transformation (Section 8.2.5) needs to take
the alignment into account, too. Therefore, an inter-procedural pointer alignment analysis [73]
has been implemented for precise alignment information. It analyses every memory access done
through pointers with respect to the capabilities of the SIMD memory unit. The offset from
the supported SIMD memory boundary, that is, the alignment, is calculated using the modulo
operator. If p is a pointer and N the SIMD memory address size, then the alignment of the
memory access is given by:
alignment = p mod N (8.1)
8.2. SIMD Framework 85
Since each pointer might have different values and associated alignments the information is stored
as a set E of possible values modulo N. If M = {0, . . . , N − 1} is the set of all possible values
modulo N and P = P (M) its power set, then E ∈ P . In order to evaluate pointer arithmetic
such as *(p+i) a transfer function
fg : P
n 7−→ P (8.2)
is used to compute the impact on E. The transfer function, naturally, depends on the operator of
the arithmetic expression. For example, the most common operations in address calculation, the
addition and multiplication, are binary operators and thus, the corresponding transfer functions
have the form fBinary :M ×M 7→M . This leads to the following equations:
fAdd(a, b) = (a+ b) mod N = [(a mod N) + (b mod N)] mod N
fMul(a, b) = (a · b) mod N = [(a mod N) · (b mod N)] mod N
(8.3)
Similar transfer functions exist for the remaining operators.
8.2.4 SIMD Analysis
The preparative loop transformations consist of strip mining, scalar expansion and loop unrolling.
They must be parameterized according to the underlying SIMD architecture. Incorrect parameters
might prevent SIMD optimization or lead to non optimal results. The transformations often only
pays off, if the SIMD optimization is later on enabled. Therefore, it is important to apply them
only to the most promising loops for SIMD optimization. Hence, a SIMD analysis engine is
implemented that runs in advance to identify those loops which contain SIMD candidates. For
this purpose the SIMD candidate matcher is employed. Consequently, if the loop body does not
contain any SIMD candidate then it does not make sense to consider it further. Otherwise it
determines for each SIMD candidate how many of them would be needed to build a SIMD-set
that matches one of the available SIMD instructions using the SIMD-set constructor. From this
information it derives the parameters for the different loop transformations.
8.2.5 Strip Mining and Loop Peeling
Many vectorizable loops cannot be directly optimized in case the iteration count is larger than the
number of SIMD candidates ks that fit into a SIMD-set s for the vector operation. Strip mining
is a loop transformation that divides the loop into strips, where each strip is no longer than the
SIMD data path width [159]. Essentially, the loop is decomposed into two nested loops (Listing
8.1):
1. An outer loop (the strip loop) which steps between strips
2. An inner loop (the element loop) which steps between single iterations within a loop.
86 Chapter 8. SIMD Optimization
The SIMD analysis calculates the iteration count of the element loop, called the strip size, based
upon all SIMD-sets S that can be built with the identified SIMD candidates in the loop. Since it
might happen that each SIMD-set has a different number of sub-registers k, the maximum strip
size for the transformation is selected:
strip size = max(
⋃
s∈S
ks) (8.4)
However, due to possible alignment constrains of the SIMD architecture, strip mining must ensure
that each strip starts at an alignment boundary. Assuming that arrays are word aligned in memory
then the alignment boundaries are given by:
alignment boundaries = {i | i mod strip size = 0} (8.5)
where i is the loop counter. However, strip mining is performed in the iteration space. Thus, for
array references like [i + c] with c being a constant and c 6= 0, the alignment boundary for each
strip can differ from the real alignment in memory. Therefore, an offset can be set (if it remains
constant within the loop) to readjust the alignment boundaries defined in the iteration space so
that they correspond with the real alignment in memory. Consequently, the offset is always within
the range (−strip size, strip size). The alignment boundary is then given by:
alignment boundaries = {i | i+ offset mod strip size = 0} (8.6)
// original loop
for (i = iFrom;
i < iTo;
i++)
{
A[i+c] = B[i+c] ∗ C[i+c];
}
//strip loop
//strip size = max. #sub−registers
for (is = iFrom;
is < iTo;
is += strip size)
{
//element loop
for (i=is; i<is+strip size; i++)
{
A[i+c] = B[i+c] ∗ C[i+c];
}
}
Listing 8.1: Strip Mining with offset = 0
// peeled iterations
for (i = iFrom;
i < iFrom + (mod(−(iFrom+offset),
strip size)); i++)
{
A[i+c] = B[i+c] ∗ C[i+c];
}
//strip mined loop
for (is = iFrom + mod(−(iFrom+offset),
strip size);
is < iTo − mod(iTo+offset,
strip size);
is += strip size)
{
for (i = is; i < is+strip size; i+=1)
{
A[i+c] = B[i+c] ∗ C[i+c];
}
}
Listing 8.2: Strip Mining with offset != 0
8.2. SIMD Framework 87
The boundary information can be easily computed using the information from the alignment
analysis. If the offset remains constant within the loop it can be eliminated by loop peeling. That
means, those iterations causing the misalignment are “peeled off” the original loop and build a
loop on their own (Listing 8.2). Note that the modulo operation must produce a value in the
range [0, strip size). Furthermore, it must take care of overflows that might occur during the
computation of the loop boundaries.
8.2.6 Scalar Expansion
When scalars are assigned and later used in the loop, the dependency graph will include flow
dependence relations from the assignment to each use and loop carried anti-dependencies from
each use back to the assignment. These anti-dependence relations often cause problems in other
transformations and could prevent parallelization of the loop (Listing 8.3). However, the anti-
dependence relation can be broken by scalar expansion [159]. The basic idea is to allocate an array
with one element for each iteration and replace each scalar reference in the loop with a reference to
the array. This eliminates the anti-dependence relations. The computed value should be assigned
to the original scalar after the loop (Listing 8.4). One obvious drawback of scalar expansion,
though, is the increased memory consumption of the program. If not carefully managed, this
penalty can overcome the benefits gained by SIMD. For instance, the memory usage can be
reduced by strip mining the loop and only expanding the inner element loop.
for (i=0; i < N; i++)
{
s = B[i] ∗ C[i];
A[i] = s+1/s;
}
Listing 8.3: Scalar s causes anti-dependence
for (i=0; i<= N; i++)
{
S[i] = B[i] ∗ C[i];
A[i] = S[i]+1/S[i];
}
s = S[N];
Listing 8.4: Replaced scalar with array access
8.2.7 The Vectorizer
A classical vectorizer parallelizes the whole loop at once provided that suitable SIMD instructions
are available for all statements in the loop body and no data dependencies limit parallelization.
Another prerequisite is that the iteration count must match the number of SIMD candidates
needed to build the SIMD-set for the vector operation. Obviously, this is a perfect match for
strip mined loops. The vectorization algorithm is exemplified in Figure 8.4. In the first step (1) it
checks all inner loops whether each statement consists only of SIMD candidates using the SIMD
candidate matcher. In step (2), it virtually duplicates the SIMD candidates according to the
iteration count of the current loop. For these virtual SIMD candidates it tries then to construct a
SIMD-set that matches an available SIMD instruction with the SIMD-set constructor (3). Finally,
88 Chapter 8. SIMD Optimization
if valid SIMD-sets can be constructed for each statement then the whole loop will be replaced by
the corresponding SIMD instructions (4).
//Elementloop, strip_size = 2
for (i = is; i < is + 2; i++) {
A[i] = B[i] * C[i];
}
//Element loop, strip_size = 2
for (i = is; i < is + 2; i++) {
A[i] = B[i] * C[i];
}
=
*A[i]
B[i] C[i]
SIMD candidates
=
*A[i]
B[i] C[i]
=
*A[i]
B[i] C[i]
…
SIMD_mul_2x16(x,y)SIMD_mul_2x16(x,y)
SIMD_store_2x16(x,y)SIMD_store_2x16(x,y)
…
…
Available vector instructions
(2) Virtually duplicate
int* tmp1=(int*)A;
int* tmp2=(int*)B;
int* tmp3=(int*)C;
*tmp3 = SIMD_mul_2x16(*tmp1,*tmp2);
int* tmp1=(int*)A;
int* tmp2=(int*)B;
int* tmp3=(int*)C;
*tmp3 = SIMD_mul_2x16(*tmp1,*tmp2);
(3) Construct
(4) Replace
After vectorization
(1) Check loop
statements
Figure 8.4: Vectorization example
Of course, it might happen that not all loop statements can be directly parallelized e.g. due to
data dependencies. But still they may contain a certain degree of parallelism. Therefore, loops
which could not be vectorized are further processed by the more powerful unroll and pack based
SIMDfyer.
8.2.8 Loop Unrolling
The SIMDfyer implements a technique similar to [217]. This requires loops to be unrolled properly
to ensure full utilization of the SIMD data path. The SIMD analysis customizes the unroll factor
to the number of SIMD-candidates ks that fit into a SIMD-set s that can be constructed for the
given loop body. This is basically the same as for the strip size calculation. Consequently, strip
mined loops will be unrolled completely if they are not vectorized. It may happen that the loop
contains several SIMD candidates which can be combined in different ways to a SIMD-set. Thus,
since it is desired to fill all possible SIMD-sets S, the best unroll factor can be calculated as:
unroll factor = max(
⋃
s∈S
ks) (8.7)
The SIMD analysis annotates the unroll factor to each loop that contains SIMD candidates. The
value of all loops left after vectorization will be read by the Loop Unroller to prepare them for the
SIMDfyer.
8.2. SIMD Framework 89
8.2.9 The Unroll-and-Pack based SIMDfyer
For a given IR of an input C program, an iterative algorithm is used that combines SIMD candi-
dates into SIMD-sets and replaces such sets by CKFs in the IR [46]. Even though the algorithm
could in principle process all basic blocks inside a procedure, it focuses only on the loops, typically
the hot spots of the input program. More specifically, only those where the SIMD analysis iden-
tified SIMD candidates before. Certain multiple basic block constructs, though, may have been
merged into a single basic block by an if-conversion [111] pass prior to the SIMD optimization.
The algorithm forms SIMD instructions step by step. If a complete SIMD-set could be built it will
be replaced by the corresponding CKF. Since each iteration may generate new SIMD candidates,
the list of SIMD candidates is updated after each step. The identification of SIMD candidates
is performed by the SIMD candidate matcher. The basic idea of the iteration is illustrated in
Figure 8.5.
=
*A[i]
B[i] C[i]
SIMD candidates
(2)
(1)
=
*A[i+1]
B[i+1] C[i+1]
=
ExtractA[i]
=
SIMD
mul
B[i],B[i+1] C[i], C[i+1]
Extract
SIMD_mul_2x16(x,y)SIMD_mul_2x16(x,y)
SIMD_store_2x16(x,y)SIMD_store_2x16(x,y)
Available SIMD instructions
SIMD
store
A[i+1]
A,[i], A[i+1] SIMD
mul
B[i], B[i+1] C[i], C[i+1]
(3)
Figure 8.5: IR states in different iterations
State (1) shows the initial IR structure for a sample loop body (unrolled twice) that performs a
multiplication of two vectors B, C and stores the result in vector A. The left and right elements
of the computations are isomorphic and are assumed to meet the memory alignment constraints.
Firstly, the algorithm combines the left and the right operands (16-bit load operations) of the two
“*” to 32-bit SIMD load operations. Afterwards, the “*” operations themselves are combined to
a SIMD instruction. The corresponding IR has the intermediate state (2). In order to preserve
the semantic correctness, explicit “extract” operations are inserted that select 16-bit sub-words
out of the 32-bit result of the SIMD dual multiplication operation. These extracts are also
considered as SIMD candidates and hence, can also be used to build a SIMD-set. Note, all
superfluous extracts are removed by dead code elimination in a later compilation phase. In the
90 Chapter 8. SIMD Optimization
following iteration, the two 16-bit “=” operations form a SIMD-set on their own. Finally, the IR
state (3) is reached and the algorithm terminates.
The presented approach employs an iterative, step-by-step approach in order to compose a SIMD
instruction from a set of SIMD candidates. In this way, an exhaustive search within the given loop
body is avoided. Therefore, it requires only low-degree polynomial complexity (O(n3)) worst case
for n variable accesses in the IR. Practical experience shows that this relatively simple heuristic
consumes only a few CPU seconds of compilation time while utilizing SIMD instructions very well
for speeding up common DSP code benchmarks. Due to the possible necessity of inserting extra
code for dynamic pointer alignment checks before loop entry points and the corresponding code
duplication, insertion of SIMD instructions may lead to an increase in code size.
void dotproduct(short ∗pa,
short ∗pb,
short ∗pc)
{
short sum;
short S[2];
sum = 0;
S[0] = S[1] = 0;
for (int is = 0; is < 64; is += 2)
{
S[0] = S[0] + (∗pa ∗ ∗pb) ∗ ∗pc;
pa++; pb++; pc++;
S[1] = S[1] + (∗pa ∗ ∗pb) ∗ ∗pc;
pa++; pb++; pc++;
}
sum = sum + S[0] + S[1];
}
Listing 8.5: Initial code
void dotproduct(short ∗pa,short ∗pb,
short ∗pc)
{
short sum;
short S[2];
int tmp1, tmp2;
short res0, res1, res2, res3;
sum = 0;
S[0] = S[1] = 0;
for (int is = 0; is < 64; is += 2)
{
tmp1 = (int∗)pa; //SIMD load
tmp2 = (int∗)pb; //SIMD load
res0 = EXTRACT short 1 of 2(tmp1);
res1 = EXTRACT short 2 of 2(tmp1);
res2 = EXTRACT short 1 of 2(tmp2);
res3 = EXTRACT short 2 of 2(tmp2);
S[0] = S[0] + (res0 ∗ res2) ∗ ∗pc;
pa++; pb++; pc++;
S[1] = S[1] + (res1 ∗ res3) ∗ ∗pc;
pa++; pb++; pc++;
}
sum = sum + S[0] + S[1];
}
Listing 8.6: 1st iteration
8.2.10 Code Example
This section provides a more detailed example to illustrate the representation of SIMD instructions
in the IR. Listing 8.5 shows the initial C source code after preprocessing (strip mining, scalar
8.2. SIMD Framework 91
expansion and loop unrolling). Assuming the availability of SIMD instructions for addition and
multiplication operating on two 16-bit values, the SIMD analysis determines a strip size and
an unroll factor of two for the loop transformations. Here, scalar expansion is performed on the
element loop which is then fully unrolled afterwards. It is further assumed that the target machine
requires SIMD load operations to be word aligned.
In the first iterations, the SIMDfyer identifies that two pairs of 16-bit operands can be loaded at
once. Furthermore, necessary sub-register extract functions (EXTRACT short x of 2) are inserted,
and temporary variables for intermediate results are allocated. Those extracts are needed to
preserve the semantic correctness of the code. Listing 8.6 depicts the code after the first iteration
(as generated by the IR-to-C code dump facility of the CoSy compiler platform).
void dotproduct(short ∗pa,
short ∗pb,
short ∗pc)
{
short sum;
short S[2];
int tmp1,tmp2,tmp3;
short res0,res1,res2,res3,res4,res5;
sum = 0;
S[0] = S[1] = 0;
for(int is=0; is<64; is+=2)
{
tmp1 = (int∗)pa; //SIMD load
tmp2 = (int∗)pb; //SIMD load
res0 = EXTRACT short 1 of 2(tmp1);
res1 = EXTRACT short 2 of 2(tmp1);
res2 = EXTRACT short 1 of 2(tmp2);
res3 = EXTRACT short 2 of 2(tmp2);
tmp3 = SIMD mul 2x16(tmp1, tmp2);
res4 = EXTRACT short 1 of 2(tmp3);
res5 = EXTRACT short 2 of 2(tmp3);
S[0] = S[0] + res4 ∗ ∗pc;
pa++; pb++; pc++;
S[1] = S[1] + res5 ∗ ∗pc;
pa++; pb++; pc++;
}
sum = sum + S[0] + S[1];
}
Listing 8.7: 2nd iteration
void dotproduct(short ∗pa,short ∗pb,
short ∗pc)
{
short sum;
short S[2];
sum = 0;
S[0] = S[1] = 0;
if( ((pa |pb |pc) & 3) == 0 )
{
for (int is = 0; is < 64; is += 2)
{
(int) S[0] =
SIMD add 2x16((int)S[0],
SIMD mul 2x16(
SIMD mul 2x16((int∗)pa,
(int∗)pb),
(int∗)pc));
pa+=2; pb+=2; pc+=2;
}
} else {
for(int is=0; is < 64; is += 2)
{
S[0] = S[0] + (∗pa ∗ ∗pb) ∗ ∗pc;
pa++; pb++; pc++;
S[1] = S[1] + (∗pa ∗ ∗pb) ∗ ∗pc;
pa++; pb++; pc++;
}
}
sum = sum + S[0] + S[1];
}
Listing 8.8: Final code
In the next iteration the two multiplications are detected as SIMD candidates and are replaced
92 Chapter 8. SIMD Optimization
by a CKF (SIMD mul 2x16). The SIMD multiplication implies certain conditions in which sub-
registers the input operands must be located in. Since the input operands are given by the extract
operations from the previous iteration these conditions can be easily met by directly using the
temporaries the input operands are extracted from. Obviously this makes the extract operations
from the previous iteration superfluous. The resulting code is depicted in Listing 8.7.
Listing 8.8 shows the final code after several further steps. The SIMD-set computation has been
finalized by detecting that the multiply results can be processed further by SIMD additions.
No extract operations are required since the results can be directly written by a wide store
to the array created by scalar expansion. Here it is assumed that the alignment analysis can-
not resolve the alignment of the pointers, thus a dynamic alignment check has been inserted
(if(((pa|pb|pc) & 3) == 0)) to rule out misaligned pointers. If the check fails, a non-SIMD
version of the loop is executed in the else-branch. Finally, standard optimizations, such as dead
code elimination, have been invoked to remove superfluous operations (e.g. extracts) from previous
phases. The resulting code is passed to the compiler backend for assembly code generation.
8.3 Retargeting the SIMD Framework
To retarget the SIMD Framework basically two pieces of information are required: Firstly, a
description of IR tree patterns which represent a SIMD candidate. This is used to generate the
SIMD candidate matcher. Secondly, the SIMD-set construction, the specification of how SIMD
candidates can be composed to a valid SIMD-set.
8.3.1 SIMD Candidate Matcher
The identification of SIMD candidates can be implemented using the tree covering based code
selection [221]. SIMD candidates can be easily described by regular mapping rules. Normally, such
a rule describes how a certain IR operation is mapped to target assembly code. Nonterminals,
typically the rule operands, are used as “temporaries” to transfer values from one rule to another.
From this specification a tree pattern matcher for code selection can be generated with tools like
Burg [44]. In this approach the regular CoSy tree pattern matcher generator is utilized to create
a dedicated SIMD candidate matcher from SIMD candidate rules which are part of the regular
code selector description 1. Such rules use special SIMD nonterminals containing two specific
attributes: A pos field for the sub-register number within a full register and an id to identify a
memory area, for example, allocated by a scalar variable or an array (Figure 8.6).
As will be explained later in more detail, the former is needed to check sub-register or alignment
constraints and the latter becomes important when the packed result of a SIMD operation is
1This is no contradiction to the limitations of tree pattern matching mentioned in Section 8.1. The matcher is
only employed to identify those IR operations which might be composed to a full SIMD operation, the complete
SIMD match cannot be found directly.
8.3. Retargeting the SIMD Framework 93
id =1
pos = 0
id = 1
pos = 1
id = 1
pos = 2
id = 1
pos = 3
short a[4];
id = 2
pos = 0
short b;
Figure 8.6: Pos/id for array/scalar variable
directly consumed by another one. The initial values for these fields are already determined
by the prior dataflow/alignment analysis and are initialized when a load operation is matched.
Furthermore, each rule can be referenced using its unique rule name. Examples for two SIMD
candidate rules named load and add are shown in Listing 8.9 and 8.10.
The 16-bit load rule initializes the SIMD nonterminal’s pos and id fields with the values deter-
mined by dataflow/alignment analysis. The produced SIMD nonterminal may then be consumed
by the add rule. Additional conditions can be used to select only those IR operators for a certain
data type or to specify constraints on the sub-register of the operands. In this example, the 16-bit
add rule matches only if both input operands are located in the same sub-register.
\\Syntax is name:type
RULE [load] o:mirContent(src:reg nt)
−> dst:simd nt;
CONDITION {
IS INT16(o)
}
EMIT {
dst.pos = get pos(o);
dst.id = get id(o);
}
Listing 8.9: SIMD candidate rule load
RULE [add] o:mirPlus(src1:simd nt,
src2 simd nt)
−> dst:simd nt;
CONDITION {
IS INT16(o) && src1.pos == src2.pos
}
EMIT {
dst.pos = src1.pos;
dst.id = newid(src1.id,src2.id);
}
Listing 8.10: SIMD candidate rule add
Additionally, rules to extract a sub-register from a full register must be created as well. Those
are used to match the extract operations (see Section 8.2.10) inserted in previous iterations of the
algorithm. In this way they become SIMD candidates in the current iteration. All extract rules
produce a SIMD nonterminal which sets id to the id of the temporary the result is extracted from
and the pos field to the position of the extracted sub-register respectively (Figure 8.7).
The SIMD candidate matcher’s flexibility is only limited by the capabilities of the underlying tree
pattern matcher generator. Since the concepts are already supported by the existing code selector
description only minimum changes to the retargetable compiler platform are required. Since tree
covering based code selection is the state of the art in compiler design, this part can also be easily
ported to other platforms.
94 Chapter 8. SIMD Optimization
=
Extract 1of 2A[i]
=
tmp = SIMD_mul
B[i], B[i+1] C[i], C[i+1]
Extract 2 of 2A[i+1]
id = 2
pos = 0
id = 2
pos = 1
id = 3
pos = 0
id = 3
pos = 1
id = 1
pos = 0
id = 4
pos = 0
id = 1
pos = 1
id = 4
pos = 1
id = 4
pos = 0
Figure 8.7: Pos/Id for extract operation
8.3.2 SIMD-Set Constructor
Special SIMD rules describe valid tuples N = (n1, . . . , nk) of SIMD candidates, where k denotes
the number of sub-registers. In contrast to regular mapping rules, they take the names of SIMD
candidate rules instead of nonterminals as input operands, i.e. a node ni corresponds to a SIMD
candidate rule name. The examples in Listing 8.11 and 8.12 specify a twofold 16-bit load and add
SIMD instruction, using the SIMD candidate rules from Listing 8.9 and 8.10.
SIMD RULE simd load(a:load, b:load);
COMPOSITION
CKF#1 (src:a.src)
−> dst:reg nt(a.dst, b.dst);
EMIT {
printf("LOAD32 [%s] −> %s",
REGNAME(src),REGNAME(dst));
}
Listing 8.11: SIMD rule twofold 16-bit load
SIMD RULE simd add 2x16 (a:add, b:add);
COMPOSITION
CKF#2 (arg1:reg nt(a.src1, b.src1),
arg2:reg nt(a.src2, b.src2)
) −> dst:reg nt (a.dst, b.dst);
EMIT {
printf ("\tDUALADD16\t%s,%s −> %s",
REGNAME(arg1), REGNAME(arg2),
REGNAME(dst));
}
Listing 8.12: SIMD rule dual 16-bit add
Given the set of all identified SIMD candidates C = {c1, c2, . . . }, the set of all possible SIMD-sets
S is given by S ⊆ P(C) whereas each tuple in S must be in the set of all SIMD rules R as defined
in the compiler configuration. Furthermore, it must match certain implicit conditions. Let Pos(c)
denote the pos value of the result SIMD nonterminal produced by SIMD candidate rule c and
Id(c) the id respectively. Then the set of valid SIMD-sets S is given by:
S = {(c1, . . . , ck) | (c1, . . . , ck) ∈ R ∧ Id(ci) = Id(cj) ∧ Pos(cl+1) = Pos(cl) + 1,
∀i, j ∈ (1, . . . , k), l ∈ (1, . . . , k − 1)}
(8.8)
In other words, the SIMD candidates of a valid SIMD-set must have the same id as well as an
8.3. Retargeting the SIMD Framework 95
increasing pos value assigned.
Consider the example shown in Listing 8.13. In the first iteration, the load rule covers the array
accesses, initializes the id with an unique number and the pos field with the position relative
to SIMD load memory boundary. Note that accesses to the same array get always the same id
assigned. Only the pos field varies. It is assumed that the arrays are aligned to a word boundary.
Now, due to the implicit condition of the SIMD load, the only way to create a complete SIMD-set is
to combine two adjacent loads (i.e. increasing pos) from the same id. All other combinations would
violate at least one constraint. Both SIMD loads create a temporary with a new id. Afterwards,
the operations to extract the sub-registers have been inserted as well. As mentioned above, the
extracts create also new temporaries which get the same id as the temporary the sub-register is
extracted from assigned and the pos field is set to the extracted sub-register number respectively.
for(i=0; is < 64; i += 2)
{
// <pos=0,id=1> <pos=0,id=2>
a[i] = b[i] + c[i];
// <pos=1,id=1> <pos=1,id=2>
a[i+1] = b[i+1] + c[i+1];
// <pos=0,id=3> <pos=0,id=4>
x[i] = y[i] + z[i];
// <pos=1,id=3> <pos=1,id=4>
x[i+1] = y[i+1] + z[i+1];
}
// In the 1st iteration:
// load −> <pos=0,id=1>, ...
// SIMD load(<pos=0,id=1>,<pos=1,id=1>)
// −> <pos=0,id=5>
// SIMD load(<pos=0,id=2>,<pos=1,id=2>)
// −> <pos=0,id=6>
// EXTRACT short 1 of 2(<pos=0,id=5>)
// −> <pos=0,id=5>
// EXTRACT short 2 of 2(<pos=1,id=5>)
// −> <pos=1,id=5>
// EXTRACT short 1 of 2(<pos=0,id=6>)
// −> <pos=0,id=6>
// EXTRACT short 2 of 2(<pos=1,id=6>)
// −> <pos=1,id=6>
// ...
Listing 8.13: pos/id in 1st iteration
for(i=0; is < 64; i += 2)
{
//<pos=0,id=5>
tmp1 = (int∗)(b+i);
//<pos=0,id=5>
res0 = EXTRACT short 1 of 2(tmp1);
//<pos=1,id=5>
res1 = EXTRACT short 2 of 2(tmp1);
//<pos=0,id=6>
tmp2 = (int∗)(c+i);
//<pos=0,id=6>
res2 = EXTRACT short 1 of 2(tmp2);
//<pos=1,id=6>
res3 = EXTRACT short 2 of 2(tmp2);
...
// <pos=0,id=5> <pos=0,id=6>
a[i] = res0 + res2;
// <pos=1,id=5> <pos=1,id=6>
a[i+1] = res1 + res3;
...
}
// In the 2nd iteration:
// add(<pos=0,id=5>,<pos=0,id=6>)
// −> <pos=0,id=56>
// add(<pos=1,id=5>,<pos=1,id=6>)
// −> <pos=1,id=56>
// SIMD add(<pos=0,id=56>,<pos=1,id=56>)
// ...
Listing 8.14: pos/id in 2nd iteration
Thus, in the next iteration (Listing 8.14) the first and second operand of the first two additions
share the same ids. Consequently, the same id is generated for both results of the additions.
96 Chapter 8. SIMD Optimization
Now they can be combined to a SIMD add. The implicit id condition actually enforces that the
packed operands of the previous SIMD load are directly reused, otherwise this might result in
an expensive repacking of the operands if for instance the first addition is combined with the
fourth addition. Note that it is also possible to specify an explicit condition for the SIMD rules
to overwrite the defaults for pos and id. As an example, the conditions on the pos fields can be
used to model unaligned SIMD memory operations.
In order to complete the retargetable compilation flow, the CKF calls in the resulting intermediate
code must be replaced by valid assembly instructions for the target processor. In this framework,
the COMPOSITION for a SIMD rule specifies the CKF call which is internally generated for an
identified SIMD-set. It consists of an unique CKF number, the argument(s) to be passed to
the CKF call and the assembly code that is finally emitted. For example the COMPOSITION for
SIMD add 2x16 describes that the arguments for the CKF call are register nonterminals which
contain the first and second operand of the combined add rules. From this specification, a regular
code selector rule matching the CKF with the given number and assembly syntax is automatically
generated (Listing 8.15) and becomes part of the regular backend code selector.
RULE [CKF#2] o:IR FuncCall( arg1:reg nt,arg2:reg nt)
−> dst:reg nt;
CONDITION {
CKF Number(o) == CKF#2
}
EMIT {
printf ("\tDUALADD16\t%s,%s −> %s",
REGNAME(arg1),REGNAME(arg2),REGNAME(dst));
}
Listing 8.15: Internally generated CKF rule for SIMD add 2x16
Like for the SIMD candidate matcher, many concepts are already supported by the existing tree
pattern matcher generator. Thus, only a few changes are required to the existing generator to
support this approach.
As mentioned in Chapter 6, the Compiler Designer tool comprises techniques to generate mapping
rules automatically from the LISA model. Since the SIMD configuration is quite similar to a
regular code selector description, the Compiler Designer has been extended in order to specify and
generate rules for SIMD instructions, too. More specifically, the user creates the SIMD candidate
rules using the mapping dialog. In the next step the user can select those SIMD candidates which
build a SIMD-set and assign a proper assembly instruction. From this specification, a SIMD
enabled code selector description for the CoSy compiler platform is finally generated.
8.4. Experimental Results 97
8.4 Experimental Results
For experimental evaluation SIMD-enabled C compilers have been created for the NXP TriMedia
processor [169] and the ARM11 [33]. The TriMedia compiler has been designed using the Compiler
Designer tool whereas the ARM11 compiler is a handcrafted CoSy compiler. Both architectures
support SIMD only for short (i.e. 8-bit and 16-bit) integer data types – which is quite common
for embedded processors. Obviously, this constraint has to be taken into account for benchmark
selection. Therefore, mostly benchmarks from the DSPStone benchmark suite [246] have been
selected and several additional kernels have been implemented, similar to those used in [104] [77]
[63]. Furthermore, additional results for the following more complex DSP algorithms are provided:
quantize matrix quantization with rounding
compress discrete cosine transformation to compress a 128 x 128 pixel image by a factor of 4:1,
block size of 8 x 8
idct 8x8 IEEE-1180 compliant inverse discrete cosine transformation
viterbi GSM full rate convolutional decoder
emboss Converts an image using an emboss filter
sobel Applies a sobel filter to an image
corr gen Generalized correlation with a 1 by M tap filter
For the TriMedia and ARM compilers, the required retargeting effort for SIMD support was one
day for each. A similar workload can be expected for other processors, depending on architecture
features.
Regarding the SIMD architecture, the TriMedia is a 5-slot VLIW DSP with 128 general
purpose registers and a number of SIMD instructions. Due to its VLIW architecture, using
SIMD instructions does not lead to a speedup in all cases. For instance, one can issue 5
parallel ADD instructions simultaneously, while only 2 dual-ADD SIMD instructions can be
issued at a time. Furthermore, SIMD instructions may have a higher latency than regular
instructions (e.g. one cycle for an ADD vs. two cycles for a dual-ADD). So, unless the instruction
scheduler is not able to find suitable instructions for filling the VLIW slots saved by SIMD,
no speedup can be expected. However, if the memory is the bottleneck (at most two parallel
LOADs/STOREs), SIMD instructions still help to reduce the memory pressure. There are also
further effects, due to the C coding style or register allocation effects in the compiler backend,
that lead to deviations from the theoretical speedup factor k in case of k sub-registers. The
memory is organized in 32-bit words, hence word alignment is required for SIMD memory accesses.
98 Chapter 8. SIMD Optimization
In contrast, the ARM architecture is built around a central, scalar RISC core. It has a register
file which consists of 31 general purpose registers (at any one time only 16 register are visible)
and 6 status registers. The memory is also organized in 32 bits words. It requires the same
word alignment for all memory accesses as the TriMedia. The ARM11’s instruction-set supports
only a limited set of SIMD instructions which consists of additions and subtractions of byte
or half-word data values in 32-bit registers. Furthermore, the ARM features a complex dot-
product support operation, that multiplies two pairs of half-words in parallel, and adds the two
resulting word wide values to an accumulator. Since there is no direct SIMD multiplication oper-
ation available, kernels that do not match this dotproduct support operation cannot be optimized.
The results are quantified first for one simple, particular benchmark, that is, a dotproduct, where
vector elements are accessed by means of array accesses in the C code:
for(i = 0; i < N; i++)
sum += a[i] ∗ b[i];
Listing 8.16: Dotproduct
Due to the dependency on sum, a scalar expansion has to be applied to the loop before SIMD
instructions can be inserted. First of all, the impact of the alignment analysis and the overhead
introduced by scalar expansion is investigated. Figure 8.8(a) shows the speedup over the number
of loop iterations I with and without alignment analysis. It can be clearly seen that a certain
iteration count is required to compensate the overhead by scalar expansion until SIMD pays offs.
Beyond that, the speedup is largely independent of I. For high iteration counts the speedup is
asymptotically 2, which corresponds to the theoretical speedup in this case. Obviously, the version
without the dynamic alignment check reaches the break-even point considerably faster than the
one with the checks. The reason for the extremely high speedup obtained on the ARM processor
is due to type conversions. Since the multiplications in the non SIMD version produce results of
32 bits size, these have to be converted to 16-bit precision afterwards. The ARM compiler however
generates a sequence of a logical left shift by 16 bits, followed by an arithmetic right shift back to
achieve this. In the SIMD version, though, these steps are not necessary since the results of the
operations are already 16-bit values.
The former two cases have demonstrated the dependence of the speedup on the iteration count.
Another interesting figure is the development with dependence on rising unroll factors (after SIMD
optimization). The example given in Figure 8.8(b) shows the progression for the dotproduct. The
number of iterations for this graph has been chosen to N = 128. As apparent from Figure 8.8(a),
this is a number where the speedup is already very close to its peak value.
In the values for the TriMedia little difference is seen between the versions with or without dynamic
checks. The strong rise in speedup for the high unroll factors is due to the additional resource
8.4. Experimental Results 99
0,7
1
1,3
1,6
1,9
2,2
2,5
2 4 8 16 32 64 128 256 512 1024
Iterations
S
p
e
e
d
u
p
fa
c
to
r
TriMedia dynamic TriMedia static
ARM dynamic ARM static
(a) Speedup factor over loop iterations for dotproduct
1
1,5
2
2,5
3
U2 U4 U8 U16
Unroll factor
S
p
e
e
d
u
p
fa
c
to
r
TriMedia dynamic TriMedia static
ARM dynamic ARM static
(b) Speedup factor over unroll factor for dotproduct
Figure 8.8: Speedup factors for benchmarks
pressure created by the large loop body. Since the VLIW architecture is inherently parallel, this
pressure is needed to completely saturate the CPU. The ARM’s progression, however, shows an
unexpected decline in performance for higher unroll factors. After close examination the cause has
been determined to be register shortage resulting in a considerable amount of spill code. Obviously,
the ARM greatly benefits from the removal of the dynamic check, since registers are freed and
thereby more degrees of freedom are left to the register allocator. The TriMedia processor with
its 128 available registers is not affected by this problem.
Finally, Figure 8.9 summarizes the speedup results for all benchmarks. In the presence of dy-
namic alignment checks the SIMD loop version including the alignment check overhead has been
measured. A significant speedup was obtained in most cases. The speedup for the complex DSP
routines is generally lower, since a smaller fraction of the benchmark code can be mapped to
SIMD instructions than in the case of the DSPStone kernels. Still, a speedup of 7% up to 66%
was observed. In certain cases a super-linear speedup for the ARM can be achieved (e.g. 2.2 for
fir). This is related to the special multiply instructions of the ARM which helps to reduce the
overhead introduced by scalar expansion. On the other hand, for three benchmarks no speedup
could be obtained for the ARM due to the lack of a multiplication without accumulation.
Analogous to the speedup, a code size decrease by a factor of 0.6 on average can be observed, as
compared to the benchmarks without use of the SIMD engine (but with enabled loop unrolling),
and a code size increase by a factor of 1.5 for matrix1x3 which required a dynamic alignment
check.
100 Chapter 8. SIMD Optimization
0
0,5
1
1,5
2
2,5
ve
cto
r a
dd
itio
n fir
n_
re
al_
up
da
tes
n_
co
m
ple
x_
up
da
tes
do
t_p
ro
du
ct
m
atr
ix1
m
atr
ix3
qu
an
tize
co
m
pre
ss idc
t
vite
rbi
em
bo
ss
so
be
l
co
nv
_
3x
3
co
rr_
ge
n
Benchmark
S
p
e
e
d
u
p
fa
c
to
r
TriMedia ARM
Figure 8.9: Benchmark results
8.5 Conclusions
Almost all previous approaches to SIMD optimization are tailored to a specific target architecture.
This thesis presents a retargetable optimization framework for the class of processors with SIMD
support. The underlying concepts are proven by integrating the SIMD framework into the CoSy
platform that can be retargeted via the Compiler Designer GUI. In this way, SIMD-enabled
compiler for two realistic embedded processors were generated. The required retargeting effort is
quite limited for both compilers.
This results in a seamless and retargetable path from a single LISA model to a SIMD enabled
C compiler. While previous backend-oriented SIMD optimization techniques potentially lead to
higher code quality, significant speedup results for standard benchmarks were generally obtained
with this framework. Hence, the presented approach provides a good and practical compromise
between code efficiency and compiler flexibility.
The current implementation shows several limitations, whose elimination would probably lead to
higher code quality and would allow to handle a wider range of loop constructs. As pointed out
in [5, 188, 63], SIMD optimization is often hindered by limitations of the SIMD memory unit
in combination with the memory access patterns in current applications. It is often necessary
to reorder the sub-registers, using special permute instructions before SIMD instructions can be
8.5. Conclusions 101
applied at all. So far, these instructions are rarely supported by embedded processors. However,
with the advances in semiconductor technology the SIMD data path width will increase in the
future and thus, it becomes more likely that next generation embedded processors will support
those. Therefore support for permutation seems to be a promising extension for the future.
Chapter 9
Predicated Execution
This chapter focuses on another class of target processors, namely those equipped with deep
pipelines and parallel functional units like VLIW architectures for instance. Such architectures
are quite popular in embedded system design since they do not require designs to sacrifice
software development productivity for the very high-performance processing needed for today’s
applications. Naturally, to achieve their peak performance all parallel functional units must
be kept busy during program execution. Thus, a common hardware features to increase the
amount of available Instruction Level Parallelism (ILP) is Predicated Execution (PE). Basically,
this allows to implement If-Then-Else (ITE) statements without jump instructions which offers
a number of optimization opportunities. Furthermore, PE can enable more aggressive compiler
optimizations which are often limited by control dependencies. For example, software pipelining,
which is crucial to achieve high performance for ILP processors, can be substantially improved
by PE [168]. However, this feature is by far not limited to highly parallel and deeply pipelined
processors. Even though less beneficial, single issue embedded processors like the ARM9 [33] or
configurable cores [36] are equipped with this feature, too. Clearly, support for PE in retargetable
compilers is of strong interest.
This chapter starts with looking at the issue for exploiting PE in ITE statements, before related
work is discussed in Section 9.2. Section 9.3 presents the optimization concepts. Afterwards,
Section 9.4 introduces the retargeting formalism and the code generation flow. Section 9.6 provides
experimental results for several embedded processors. Finally, this chapter is summarized and
some future work is discussed in Section 9.7.
9.1 Code Example
Predicated execution refers to the conditional execution of instructions based on the value of a
boolean source operand p. Irrespective of p’s value the instruction allocates the same processor
103
104 Chapter 9. Predicated Execution
resources. In case p is false the computed result is ignored, i.e. it effectively behaves like a No-
Operation (NOP) instruction. Compilers utilize this to implement ITE statements without jump
instructions. As pointed out in [111], this can also be seen as converting control dependencies into
data dependencies, also referred to as if-conversion.
If( cond ) {
} else {
}
If ( cond ) {
} else {
}
p = cond
[p] goto Then
goto End
Then:
End:
p = cond
[p] goto Then
goto End
Then:
End:
[!p]
[!p]
[p]
[p]
goto
[p] goto
goto
[p] goto
[! p]
[! p]
[! p]
[! p]
[p]
[p]
[! p]
[! p]
[p]
[p]
[p]
Else Block
Then Block
Else Block
Then Block
p = condp = cond
[!p] Else Block
[p] Then BlockPE
Jump
Jump delay slotJump delay slot
(1) PE advantageous (2) PE causes overhead
Empty slotEmpty slot
Figure 9.1: Implementation of an if-then-else statement with jump and conditional instructions
Consider the example in Figure 9.1. The implementation on the right shows the common imple-
mentation of an ITE statement. It uses conditional jumps to model the control flow resulting from
the C code example on the left. The implementation with conditional instructions predicates the
then block with the result of the if-statement’s condition p and the else block with the negation
thereof respectively.
Since jump instructions typically cause control hazards (cf. Section 3.3.4) the delay slots of the
jump instructions have to be filled with NOPs or with other useful instructions (in case there are
any). PE in contrast eliminates the control flow instructions which results in a single, but larger
basic block containing the still mutually exclusive then and else block. Larger basic blocks result
in more opportunities to exploit ILP. In the ideal case both blocks can be completely parallelized
on an ILP processor. Case (1) exemplifies this for a two issue slot processor. There are not enough
instructions to fill the delay slots in the jump implementation whereas the PE implementation
not only eliminates the delay slots, but also completely parallelizes the then and else block.
Unfortunately, if-conversion does not always pay off. It may also happen that, due to resource
conflicts during scheduling, the final schedule for the PE implementation has a larger length than
the implementation with jump instructions. Case (2) illustrates this. Here, there are few free
slots left in the then and else block and hence, there is almost no chance to parallelize them.
Consequently, the actual performance of both implementations always depends on the concrete
9.2. Related Work 105
input program. Therefore, a precise cost computation is crucial to avoid a performance loss with
PE.
9.2 Related Work
Many compilation techniques for PE are based on the work by Mahlke et al. [220]. It describes
the formation of so called hyperblocks, an extended basic block concurrently executing multiple
threads of conditional code. The decision whether to include a basic block in a hyperblock is
based on the criteria of execution frequency, block size and instruction characteristics. Since it
does neither take the degree of ILP into account nor the dependencies between different blocks,
scheduling for machines with few issue slots increased the resource interference and thus, resulted
in performance degradation. August et al. [52] improved this work by allowing the scheduler
to revise decisions on hyperblock formation. But this leads to a complicated scheduler imple-
mentation. Additionally, it extends the previous work by partial if-conversion: in many cases,
including only a part of a path may be more beneficial than including or excluding the entire
path. Smelyanskiy et al. [156] try to solve the resource interference of Mahlke’s approach by a
technique called predicate-aware scheduling. However, they state that an architecture that sup-
ports their optimization proposal does not exist yet. All hyperblock-based approaches optimize
the average execution time.
The approach by Leupers [196] focuses especially on embedded processors and optimizes the worst-
case execution time. In contrast to the previous work it is capable of handling complete (possibly
nested) ITE statements with multiple basic blocks at a time. It has been selected as starting point
for this thesis to develop a retargetable PE optimization.
Hazelwood et al. [129] incorporated a lightweight if-conversion into a dynamic optimization sys-
tem. However, the overhead of such systems makes Hazelwood’s work less suitable for embedded
processors. Chuang et. al [249] target primarily out-of-order architectures, which are rarely used
in the embedded domain. By combining control flow paths, PE introduces false dependencies
between instructions of disjoint paths. In [137] these dependencies are resolved by means of pred-
icated Static Single Assignment (SSA). The downside is a significantly increased code size and
the high amount of required predicate registers – both are severe issues in the embedded domain.
From the ASIP design platforms mentioned in Chapter 4 only Trimaran supports PE, but this
platform is limited to a narrow range of architectures. Quite recently Target Compiler Technologies
announced support for PE, but nothing in this regard has been published yet. In the domain of
“general purpose” retargetable compilers, the gcc [78] supports if-conversion, but gcc is generally
known as being difficult to adapt efficiently to embedded processor designs.
The aforementioned PE optimization techniques are mostly adapted for a certain target machine.
Hence, porting one of them to a new processor architecture is a tedious manual process. There-
fore, the implementation in this thesis focuses on an effective deployment of PE while achieving
106 Chapter 9. Predicated Execution
retargetability for a wide variety of processors with PE support [148].
9.3 Optimization Algorithm
As already mentioned above, ITE statements can be implemented using conditional jumps or con-
ditional instructions. Another possibility is to implement only either the then or else block with
conditional instructions which is referred to as partial if-conversion. Furthermore, the concrete
implementation depends also on the nesting level of the ITE statement. The following section
introduces all possible ITE implementations, henceforth referred to as schemes. Section 9.3.3
concentrates on the cost computation of each scheme. Finally, Section 9.3.4 describes how the
best implementation is selected.
9.3.1 Implementation Schemes
In the following the infix INS denotes the implementation with conditional instructions and JMP
the implementation with conditional jumps. Furthermore, the prefix ITE stands for if-then-else
statements and IT for if-then statements. A suffix P indicates a scheme with precondition. The
notation [p] means that the following instruction or even a complete basic block is executed under
the condition stored in p. The schemes used in the example in Fig. 9.1 are depicted in Listing 9.1
and Listing 9.2.
p = R //store if−condition R
[p] goto L1 //cond. jump to Then
B E //else block
goto L2 //jump to end
L1: B T //then block
L2:
Listing 9.1: Scheme 1: ITEJMP
p = R //store if−condition R
q = !p //negate condition
[p] B T //cond. execute Then
[q] B E //cond. execute Else
Listing 9.2: Scheme 2: ITEINS
In case of a nested ITE statement the execution of the then or else block of the nested statement
depends on p (the condition of the outer ITE statement ) and on R’, which is the condition of the
nested statement itself. Hence, p constitutes the precondition for the nested ITE statement. The
corresponding schemes are shown in Listing 9.3 and Listing 9.4. Note that it is usually not possible
to attach multiple conditions to a single instruction. It is important that the precondition survives
the nested schemes, because subsequent instructions may also depend on it. Similar schemes are
obtained for IT statements (Listing 9.5 to Listing 9.7).
9.3. Optimization Algorithm 107
[p] c = R’//cond. store nested if−cond
q = !p //negate precondition
[q] c = 0
[c] goto L1 //cond. jump to Then
[p] X E //cond. exec. nested Else
goto L2 //jump to end
L1: X T //execute nested Then
L2:
Listing 9.3: Scheme 3: ITEJMPP
[p] c = R’//cond. store nested if−cond
d = !c //negate nested if−cond.
q = !p //negate precondition
[q] c = 0
[q] d = 0
[c] X T //cond. exec. nested Then
[d] X E //cond. exec. nested Else
Listing 9.4: Scheme 4: ITEINSP
p = !R
[p] goto L1
B T
L1:
Listing 9.5: Scheme 5: ITJMP
p = R
[p] B T
Listing 9.6: Scheme 6: ITINS
[p] c = !R’
q = !p
[q] c = 1
[c] goto L1
X T
L1:
Listing 9.7: Scheme 7: ITEJMPP
[p] c = R’
q = !p
[q] c = 0
[c] X T
Listing 9.8: Scheme 8: ITEINSP
p = R
[p] B T
[p] goto L1
B E
L1:
Listing 9.9: Scheme 9: ITETHEN
p = R
q = !p
[q] B E
[q] goto L1
B T
L1:
Listing 9.10: Scheme 10: ITEELSE
[p] c = R’
q = !p
[q] c = 0
[c] X T
[c] q = 1
[q] goto L1
X E
L1:
Listing 9.11: Scheme 11: ITETHENP
[p] c = R’
d = !c
q = !p
[q] d = 0
[c] X E
[d] q = 1
[q] goto L1
X T
L1:
Listing 9.12: Scheme 12: ITEELSEP
108 Chapter 9. Predicated Execution
Of course, the presented schemes with the prefix INS can only handle ITE statements whose then
and else blocks can be conditionally executed at all. Hampering elements might be instructions
that are not conditionally executable or the then and else blocks may have more than one
incoming control flow edge. By introducing new implementation schemes such ITE statements
can be handled as well. The idea is to convert ITE statements partially by executing only one
block conditionally. This leads to the implementation schemes shown in Listing 9.9 to Listing
9.12.
For instance, if the else block prevents if-conversion due to any of the above mentioned reasons
then scheme ITETHEN can be applied. According to this scheme the condition is computed in p.
Therewith the execution of the then block is predicated. If p is true the else block must not be
executed and consequently, the conditional jump to the end block is taken. Considering nested
IT statements, additional code is needed to set the condition of the ITE statement at hand to
false in case the precondition is not fulfilled.
Note that for any of the above described schemes it is assumed that the control flow from the if
block either falls through to the else block or conditionally jumps to the then block. Though
this usually depends on the concrete application and the involved compiler optimizations. The
block order might also be the other way round or sometimes the then and else blocks do not
even follow the if block directly, i.e. there is an explicit branch instruction to each block. Some
of these cases require slightly different schemes but they have been omitted here for sake of
brevity. Furthermore, the implementation depends also on the support for negated conditions.
Some processors directly support negated predicates, others need to compute them explicitly. In
the schemes shown here, it is assumed that negated predicates are not supported.
For each of these schemes the costs C, measured in instruction cycles, is computed. In the default
case, this time is calculated as C = max(CT , CE), where CT and CE denote the execution time of
the ITE statement in case the then or else block gets executed, respectively. This corresponds
to the worst case execution time of an ITE statement which is a typical measure in the context of
embedded systems due to the real time constraints. However, in certain cases it makes sense to
consider the average execution time of an ITE statement. As will be explained in the following,
they can incorporated by using transition probabilities.
9.3.2 Probability Information
The examination of several control intensive programs revealed that many ITE statements handle
errors in internal data structures or to cope with wrong program inputs. Generally, during normal
program execution these cases are unlikely to happen. However, at the same time such cases often
prevented if-conversion since the corresponding blocks dominated the worst case execution time.
Another problem has been observed in case of uneven long ITE blocks. As exemplified in Fig. 9.2,
suppose the else block is much shorter than the then block. Most likely, the instructions of the
9.3. Optimization Algorithm 109
[p]goto
goto
[!p] ...
[!p]...
[p]...
[p]...
[p]...
[p]...
[p]...
[p]...
[p] goto
goto
then
else unused
jump
delay slot
[!p] ...
[!p] ...
ITEJMP ITEELSE
ITEINS
Figure 9.2: Uneven long then and else blocks
else block will fit into free instructions slots of the then block which consequently improves the
worst case execution time. But if the execution frequency of the else block is higher than that
of the then block, then applying if-conversion (ITEINS) results in a performance degradation in
more than 50% of all cases. Thus, converting the if-statement partially by executing only the
else block conditionally (ITEELSE) might be the better choice.
Therefore, it seems reasonable to provide the programmer an opportunity to influence the cost
computation for each ITE statement.
A solution to this problem is to provide information for the execution probability of the then and
else blocks. This can be utilized in the cost computation later on. The value P (Bx) denotes the
probability for the transition from the if block (the block containing the condition) to the then
block BT or else block BE respectively. Moreover, the sum of the probabilities gives one per
definition: P (BT ) + P (BE) = 1.
CoSy annotates each basic block with a so called use estimate, the estimated execution frequency.
These values are computed by a separate engine. Their main purpose is to improve the spill
heuristic of the register allocator, but it is evaluated in other optimizations as well. Here, in this
context, these values can be used to derive the transition probabilities.
Three constellations of if-statements as shown in Figure 9.3 must be considered. The graphs on
the left and middle are well-structured, but the right one is not due to the additional control flow
edge. In the following Ex denotes the use estimate of either the if, the then or the else block.
For the case in Figure 9.3(a) when the if block is executed, the control flow reaches either the
then or else block. Moreover, the if block dominates these blocks immediately, there exists no
other path that can be taken to reach one of these blocks (i.e. the if block is always executed
immediately before). Thus, the use estimates can be calculated as
Eif = Ethen + Eelse (9.1)
110 Chapter 9. Predicated Execution
IF
ElseThen
End
P(BT) P(BE)
(a) ITE statement
IF
Then
Else,
End
P(BT)
P(BE)
(b) IT statement
IF
ElseThen
End
…
P(BE)P(BT)
(c) ITE statement with ad-
ditional incoming control
flow edges
Figure 9.3: Different constellations of if-statements
and the transition probabilities as
P (BT ) =
Ethen
Eif
and P (BE) =
Eelse
Eif
(9.2)
The cases in Figure 9.3(b) and Figure 9.3(c) are a little bit different. Unfortunately, there is no
immediate dominance relation like in the previous case. Considering Figure 9.3(b), the if block
only dominates the then block immediately. However, the else block is identical to the end block,
which obviously is not immediately dominated by the if block. Thus, the use estimates are given
by
Eif 6= Ethen + Eelse = Ethen + Eend (9.3)
The formula to calculate P (BT ) still holds and since the sum of the transition probabilities must
be one this results in:
P (BT ) =
Ethen
Eif
and P (BE) = 1− P (BT ) (9.4)
The last case is similarly. Since the then block is not dominated by the if block, the equation
Eif 6= Ethen + Eelse (9.5)
still holds and consequently,
P (BE) =
Eelse
Eif
and P (BT ) = 1− P (BE) (9.6)
Of course, this is a simple but not very precise way to determine probability information. More
accuracy can be obtained by using profiling information. Obviously, this can yield very accurate
values, but on the other hand this method may increase compile time significantly. Another
option is to directly annotate the probabilities to the ITE statements itself using pragmas. All
three kinds are supported by the implementation.
9.3. Optimization Algorithm 111
9.3.3 Cost Computation
The implementation schemes, naturally, implicate different execution times. The cost computation
annotates to each ITE or IT statement a cost table. It stores for all schemes the corresponding
execution times. The computation assumes that a conditional instruction consumes the same
resources regardless whether its condition is true or false and that both cases have the same ex-
ecution times. In the following, the superscript P denotes the presence of a precondition. The
branch instructions and the corresponding delay slots are distinguished as JTaken, a conditional
branch that is taken, JNotTaken, a conditional branch that is not taken and JAlways an uncondi-
tional branch. Considering nested ITE statements the calculation starts with the innermost and
continues with the surrounding ITE statement.
The costs can be separated into two components: setup costs and cost values for the then and
else blocks. The former emerge from extra instructions required for negating if-conditions or to
compute possible preconditions. Obviously, the setup cost depend on the given target architecture.
For example, some architectures support negated predicates, others need an extra instruction.
The costs for computing the ITE condition itself are not taken into account since they incur for
all schemes. Table 9.1 summarizes the setup cost for each scheme, assuming that the architecture
does not support negated conditions. For example, ITEJMP has no setup costs whereas ITEINS
has a cost of one due to the additional instruction needed to negate the if-condition (see Listing
9.2).
Scheme Setup Costs
ITEJMP S1 = 0
ITEINS S2 = 1
ITEJMPP S3 = 2
ITEINSP S4 = 4
ITJMP S5 = 0
ITINS S6 = 0
ITJMPP S7 = 2
ITINSP S8 = 2
ITETHEN S9 = 0
ITEELSE S10 = 1
ITETHENP S11 = 3
ITEELSEP S12 = 4
Table 9.1: Setup costs according to the different implementation schemes.
The second component of the cost computation consists of the cost values for the then and else
blocks. A block is a sequence of statements (s1, . . . sn). The costs of a statement si are denoted
as C(si) or C
P (si), depending on whether si is executed under a precondition or not.
112 Chapter 9. Predicated Execution
If si is a simple statement the costs are C(si) = C
P (si) = 1, but if si is an ITE statement,
the costs depend on the concrete implementation scheme. C(BT ), C(BE), C
P (BT ) and C
P (BE)
denote the execution times of a then and else block without and with precondition, respectively.
In case a scheme merges both blocks, the execution time for the joint execution is denoted as
C(BT ◦ BE). In prior work this value is modeled by a static formula which takes the execution
times of the individual blocks, the ILP degree and possible resource conflicts into account. In
some cases performance degrades due to inaccurate estimation. In order to obtain more precise
values, the cost computation is coupled to the scheduler. This process is split into two phases.
In the first phase, the scheduler for the schemes with jump instructions are obtained. In the
second, those for the schemes using conditional instructions. See Section 9.5 for the detailed code
generation flow. The scheduler works only on basic block level. Hence, the statements (s1, . . . sn)
in a then and else block (or the merger of both) are grouped to the corresponding basic blocks
(G1, . . . ,Gm). The scheduler provides for each block Gi the number of cycles it needs to execute,
henceforth referred to as fillcycles F(Gi). Now, the cost for the blocks (i.e. BT , BE, BT ◦ BE) are
obtained as follows:
C(B) =
m∑
i=1

 F(Gi)
+


min{C1(si−), C2(si−), C9(si−), C11(si−)} − F(Gi) si− is ITE stmt,
min{C5(si−), C6(si−)} − F(Gi) si− is IT stmt,
0 else.

 (9.7)
CP (B) =
m∑
i=1

 F(Gi)
+


min{C3(si−), C4(si−), C10(si−), C12(si−)} − F(Gi) si− is ITE stmt,
min{C7(si−), C8(si−)} − F(Gi) si− is IT stmt,
0 else.

 (9.8)
In case the last statement in a block si− is an IT or ITE statement
1, its costs have to be taken
into account as well. As can be seen later, these costs already contain the fillcycles of the hosting
basic block, thus they are subtracted again. In the first phase of the cost computation only the
cost values of the implementation schemes ITEJMP and ITJMP are available, hence the terms
min{C1(si−), C2(si−)} and min{C5(si−), C6(si−)} (9.9)
1Only the last statement in a basic block can be a control flow statement, cf. Section 3.3.1.
9.3. Optimization Algorithm 113
reduce to
C1(si−) and C5(si−) (9.10)
The cost for these two schemes can be calculated as follows:
C1(si−) = S1 + F(Gi)
+


C(BT ) + JTaken P (BT ) > p ∧ P (BT ) > P (BE),
C(BE) + JNotTaken + JAlways P (BE) > p ∧ P (BE) > P (BT ),
max


C(BT ) + JTaken,
C(BE) + JNotTaken + JAlways

 else.
(9.11)
C5(si−) = S5 + F(Gi)
+


JTaken P (BE) > p ∧ P (BE) > P (BT ),
C(BT ) + JNotTaken P (BT ) > p ∧ P (BT ) > P (BT ),
max


JTaken,
C(BT ) + JNotTaken

 else.
(9.12)
For example, the costs for the scheme ITEJMP is composed of the setup cost S1, the fillcycles of
the block containing the condition evaluation and an additional summand which depends on the
given transition probabilities. Either the time for execution the then block plus the jump delay of
the conditional jump to reach it, or for the else plus a not taken jump plus an unconditional jump
or the maximum (i.e the worst case) of both is added. In order to provide the possibility to switch
off transition probabilities an user defined threshold p can be passed to the cost computation
which is set to 1 by default.
In the second phase the conditional schemes are computed as follows:
C2(si−) = S2 + F(Gi) +


0 D = 1,
CP (BT ) + C
P (BE) else.
(9.13)
C3(si−) = S3 + F(Gi)
+


C(BT ) + JTaken P (BT ) > p ∧ P (BT ) > P (BE),
CP (BE) + JNotTaken + JAlways P (BE) > p ∧ P (BE) > P (BT ),
max


C(BT ) + JTaken,
CP (BE) + JNotTaken + JAlways

 else.
(9.14)
C4(si−) = S4 + F(Gi) + C
P (BT ) + C
P (BE) (9.15)
114 Chapter 9. Predicated Execution
C6(si−) = S6 + F(Gi) +


0 D = 1,
CP (BT ) else.
(9.16)
C7(si−) = S7 + F(Gi)
+


JTaken P (BE) > p ∧ P (BE) > P (BT ),
C(BT ) + JNotTaken P (BT ) > p ∧ P (BT ) > P (BE),
max


JTaken,
C(BT ) + JNotTaken

 else.
(9.17)
C8(si−) = S8 + F(Gi) + C
P (BT ) (9.18)
C9(si−) = S9 + F(Gi) + C
P (BT )
+


∆(JTaken, BT ) P (BT ) > p ∧ P (BT ) > P (BE),
C(BE) + ∆(JNotTaken, BT ) P (BE) > p ∧ P (BE) > P (BT ),
max


∆(JTaken, BT ),
C(BE) + ∆(JNotTaken, BT )

 else.
(9.19)
C10(si−) = S10 + F(Gi) + C
P (BE)
+


∆(JTaken, BE) P (BE) > p ∧ P (BE) > P (BT ),
C(BT ) + ∆(JNotTaken, BE) P (BT ) > p ∧ P (BT ) > P (BE),
max


∆(JTaken, BE),
C(BT ) + ∆(JNotTaken, BE)

 else.
(9.20)
C11(si−) = S11 + F(Gi) + C
P (BT )
+


∆(JTaken, BT ) P (BT ) > p ∧ P (BT ) > P (BE),
C(BE) + ∆(JNotTaken, BT ) P (BE) > p ∧ P (BE) > P (BT ),
max


∆(JTaken, BT ),
C(BE) + ∆(JNotTaken, BT )

 else.
(9.21)
C12(si−) = S12 + F(Gi) + C
P (BE)
+


∆(JTaken, BE) P (BE) > p ∧ P (BE) > P (BT ),
C(BT ) + ∆(JNotTaken, BE) P (BT ) > p ∧ P (BT ) > P (BE),
max


∆(JTaken, BE),
C(BT ) + ∆(JNotTaken, BE)

 else.
(9.22)
9.3. Optimization Algorithm 115
The case differentiation in the formulas C2(si−) and C6(si−) is actually not necessary, because
CP (BT ) as well as C
P (BE) are zero. The blocks were appended to the if block and thus, the
costs are already contained in F(Gi). However, writing it this way makes explicit that this is only
the case if the depth D of the if-statement equals one, i.e. it is a the innermost ITE statement.
This is mainly due to a restriction of the underlying CoSy framework. The ITE blocks cannot be
merged if D > 1, so their costs must be added explicitly.
Finally, all cost values are available and the best implementation schemes can be selected.
9.3.4 Selecting the best Scheme
Obviously, the decision of applying if-conversion depends on the corresponding costs which again
depends on the execution times of nested ITE statements (bottom-up dependency). On the other
hand the costs of a nested ITE statement depend on the presence or absence of a precondition,
which is determined by the implementation scheme of the surrounding ITE statement (top-down
dependency). Therefore, the best scheme cannot be determined in a single bottom-up or top-down
pass.
if(a > b)
x = a - b; x = a + b;if (x == 13)
x = 0;
1
.
c
o
s
t
c
o
m
p
u
ta
tio
n
2
.
s
c
h
e
m
e
s
e
le
c
tio
n
- ...7-11Cost
ITEINSP ...ITEINSITEJMPPITEJMPScheme
3 ...513Cost
ITEINSP ...ITEINSITEJMPPITEJMPScheme
If ( a > b )
{
x = a – b ;
}
else
{
x = a + b ;
if (x == 13)
{
x = 0 ;
}
}
If ( a > b )
{
x = a – b ;
}
else
{
x = a + b ;
if (x == 13)
{
x = 0 ;
}
}
Figure 9.4: ITE tree, annotated cost tables and scheme selection
The search space is specified by an ITE tree T = (R,BT , BE). The root R is a boolean expression,
which is the condition of the ITE statement. The ITE blocks BT and BE correspond to the then
and else block. The scheme selection is based on a dynamic programming algorithm as presented
in [196]. This method is similar to the well known tree pattern matching algorithm. It performs
two steps to select the right implementation scheme. In the first phase, all ITE trees are traversed
bottom-up filling the cost tables for each node. The second pass is top-down. When the root node
is reached, the scheme corresponding to the cheapest entry in the root’s cost table is selected.
Based on this selection it is known whether a precondition for the son is present or not. This
determines the set of schemes (i.e. those with or without precondition) among which the cheapest
scheme is selected and so forth. This is illustrated in Figure 9.4.
116 Chapter 9. Predicated Execution
9.3.5 Splitting Mechanism
During benchmarking it turned out that for more complex programs only a small percentage of
the existing if-statements have been processed at all for various reasons: the cost computation
might decide against if-conversion, one ITE block might have multiple incoming control flow edges
or one or both ITE blocks might contain hampering elements, e.g. non predicable statements.
This is exemplified in Figure 9.5. The red lines indicate non conditionally executable statements.
Obviously, if-conversion cannot be applied to the ITE statement on the left. However, assuming
the statements B and C are independent from each other, the code depicted on the right can be
obtained. So far, only the statement level has been considered. Looking at the pseudo code level
(basically the assembly level representation of the source code) it can be observed that not all
instructions selected for the statement are necessarily not conditionally executable. Consequently,
working on the pseudo code level allows a more fine-grained operation by moving single pseudo
code nodes. The basic idea is to move these nodes to the block containing the condition evaluation
of the remaining (non predicated) ITE statement. Since it typically contains only few instructions,
most likely not all delay slots of the conditional jump can be filled. Of course this has its lim-
its. Only as many nodes as empty delay slots should be moved to avoid a performance degradation.
This idea is implemented with the splitting mechanism. The algorithm processes only non predi-
cated ITE statements which then and else blocks have a single incoming control flow edge. This
restriction avoids a complicated performance analysis because otherwise compensation code has
to be taken into account as well. Afterwards, assembly instructions are moved from the ITE
blocks as illustrated in Figure 9.3.5. It alternately selects instructions from the then and else
[p]goto
[!p] D
[p] A
[!p] E[p] goto
D E
F()
goto
A
B C
F()
goto
B C
if (x==0) {
A;
B;
C;
} else {
D;
E;
F();
}
if (x==0) {
A;
B;
C;
} else {
D;
E;
F();
}
bool tmp = (x==0);
if (tmp) {
A;
} else {
D;
E;
}
if (tmp) {
B;
C;
} else {
F();
}
bool tmp = (x==0);
if (tmp) {
A;
} else {
D;
E;
}
if (tmp) {
B;
C;
} else {
F();
} Then
Else Unused
jumpJ
Delay slot
Figure 9.5: Splitting example for a processor with two issue slots
block (i.e. A, D, and E in the example) and moves them into the delay slots of the conditional
jump where they are predicated. An instruction is considered movable, if it can be predicated
and does not change the control flow. Furthermore, it must not write a predicate which is used
9.4. Retargeting Formalism 117
as condition of the jump or as guard of an ITE block (in case of partial if-conversion). Moreover,
it must not depend on an instruction which is non-movable to simplify the dependency analysis.
If a non-movable instruction is found in one block it proceeds with instructions from the other
block. The algorithm stops either if no more movable instructions are found or if a configurable
threshold (3 in the example) is reached. Note that the pseudo code list is reordered in advance:
After a non movable node there could be other movable nodes in the pseudo code list that have
no dependencies to the non-movable node. Thus, for each node that comes after a non-movable
node it is checked whether it depends on the non-movable node. In that case, it is marked as
non-movable. Otherwise it is moved before the non-movable node.
9.4 Retargeting Formalism
An evaluation [80] of several processors for different application domains showed that processors
featuring PE can be grouped according to the location, the guard is stored in. Chiefly, the
following three categories can be obtained:
1. Processors using general purpose registers.
2. Architectures using dedicated registers and
3. those that use condition flags stored in a status register.
The first retargeting step is to configure the cost computation. Three boolean parameters for
the PE engine specify to which of the above classes the target architecture belongs. Another
boolean parameter indicates whether the architecture directly supports negated conditions or
not. Furthermore, the jump penalty J for a conditional jump taken, a conditional jump not taken
and an unconditional jump needs to be provided.
Moreover, some of the architectures can execute a wide subset of their instruction-set conditionally,
others offer only for a few instructions a predicated version. In order to determine whether an
instruction or a basic block can be conditionally executed by the target processor, the generated
tree covering based code selector is employed. As mentioned in Section 3.3.2, each rule describes
how a certain IR operation is mapped to the target assembly code. For retargeting the PE
optimization, each rule of the code selector that can emit code which is conditionally executable,
has to be annotated. Listing 9.13 shows two examples for the TriMedia [169] processor. The
rule, covering a plus node can be conditionally executed (denoted by peinclude). The other rule
is missing that annotation and thus, is assumed to be not conditionally executable by default.
Consequently, if one of the rules covering the then or else block is missing that annotation,
if-conversion cannot be applied to the corresponding if-statement. Furthermore, the instructions
of such a rule cannot be moved by the splitting mechanism.
118 Chapter 9. Predicated Execution
RULE o:mirPlus(s1:reg nt,s2:reg nt) −> d:reg nt;
CLASS peinclude;
EMIT {
print with condition("\tiadd %s %s −> %s",
REGNAME(s1),REGNAME(s2),REGNAME(d));
}
RULE o:mirIntConst −> d:reg nt;
EMIT {
print("\tuimm( %s ) −> %s ",o.Value,REGNAME(d));
}
Listing 9.13: Annotated TriMedia code selector rules
// Register r0 is always zero and r1 always one
INSTRUCTION peSetCondition (cond:reg nt) −> d:reg nt;
EMIT {
print("IF %s iadd r1 r0 −> %s",
REGNAME(cond),REGNAME(d));
}
INSTRUCTION peResetCondition (cond:reg nt) −> d:reg nt;
EMIT {
print("IF %s iadd r0 r0 −> %s ",
REGNAME(cond),REGNAME(d));
}
INSTRUCTION peNegateCondition (s:reg nt) −> d:reg nt;
EMIT {
print("IF r1 bitinv %s −> %s",
REGNAME(s),REGNAME(d));
}
INSTRUCTION peBranchAlways (label:BasicBlock);
EMIT {
print("IF r1 ijmpi ( %s )",label);
}
INSTRUCTION peBranchCond (cond:reg nt,label:BasicBlock);
EMIT {
print("IF %s ijmpi ( %s )",REGNAME(cond),label);
}
Listing 9.14: PE instruction rules for the TriMedia
For the code generation, the code emitter must take care to print the correct assembly syntax (see
Listing 9.13) in case the rule is used in a predicated block. For instance in case of the TriMedia,
the print function must prepend an IF <condition register> to the given instruction in case
the instruction is executed conditionally.
The rules covering an if-statement are responsible to generate the code for the selected ITE
scheme. Note that the generated code does not only depend on the scheme but also on the
9.5. Code Generation Flow 119
order of the then and else block. As mentioned above, some cases can only be handled with
dedicated implementation schemes, whereas for others it is sufficient to adapt the code generation.
Nevertheless, all implementation schemes can be generated with the following few instructions:
peSetCondition conditionally sets a predicate to true
peResetCondition conditionally sets a predicate to false
peNegateCondition conditionally inverts a condition
peBranchAlways unconditional jump instruction
peBranchCond conditional jump instruction
Retargeting the code generation is limited to fill in rule templates for these instructions with the
assembly code that has to be emitted. Listing 9.14 shows the filled templates for the TriMedia
processor. Additionally, each if-statement rule must call a generic function instead of printing
anything. No other information, apart from the already described, needs to be provided to retarget
the extension. This can also be performed via the Compiler Designer GUI. In this way, the PE
optimization can be quickly retargeted to varying processor configurations during architecture
exploration.
9.5 Code Generation Flow
Due to the modular concept of CoSy it is straightforward to intertwine the standard backend
components (tree pattern matcher, scheduler, register allocator) with the PE modules. Figure 9.6
depicts the backend of a CoSy compiler with PE support.
After an initial code selection with the standard tree pattern matcher the engine PEpreproc builds
ITE trees and determines those if-statements to which if-conversion can be applied. Reasons for
an exclusion can be multiple incoming control flow edges of the then or else block as well as
non predicable code in an ITE block. The latter is detected utilizing the already described rule
annotations. If a basic block is covered by a rule emitting non conditionally executable code, an
infinite cost value is assigned to the PE schemes of the corresponding if-statement. Then the
costs of the different schemes are calculated and the scheme selection is performed by the engine
PEcosts (Section 9.3.3). This engine is coupled to the normal scheduler of CoSy. In the first
iteration, the scheduler calculates the execution times of each basic block. These are used to
compute the costs for the implementation with jump instructions. Afterwards, PEcosts instructs
the scheduler to merge the then and else blocks of the innermost statements. The scheduler
parallelizes them and provides cost estimates of the block merger. Thereafter, PEcosts selects the
schemes according to the calculated costs. After the final code selection and register allocation the
engine PEcode generates the code for the chosen schemes using the above mentioned instructions.
120 Chapter 9. Predicated Execution
Code
selector PEpreproc scheduler PEcosts
Code
selector
pre-
scheduler PEcode
scheduler
(splitting)regalloc emit
MergeBT, BE
=
+a
b c
=
1d
1. PE applicable ?
2. Build ITE trees
1. PE applicable ?
2. Build ITE trees
T(BT),T(BE)
T(BT BE)
Allocate
predicate registers
Allocate
predicate registers
Scheduler feedbackScheduler feedback
Generate code for selected
ITE implementation
Generate code for selected
ITE implementation Split nonpredicable ITE
Split non
predicable ITE
Predicable
Not predicable
Figure 9.6: CoSy compiler backend with PE support
The splitting mechanism operates within the scheduler and targets all if-statements to which
if-conversion could not be applied. Apart from the compiler’s dataflow information, it uses the
annotations by the tree pattern matcher whether an instruction is predicable or not. Finally, the
code is emitted.
This approach requires limited retargeting information, also due to the coupling to existing com-
piler backend modules. These are typically part of any retargetable compiler. Thus, this approach
is not limited to the CoSy platform and consequently, it can be easily incorporated into other com-
piler platforms as well.
9.6 Experimental Results
The presented technique was successfully integrated into CoSy compilers for the
AdelanteTMVD32040 Embedded Vector Processor (EVP) [134] and the TriMedia multimedia pro-
cessor, both from NXP Semiconductors [169], as well as the ARM9 [33]. The required retargeting
effort for PE support was one day for each compiler. All three architectures can execute almost
all their instructions conditionally. The TriMedia can use any of its 128 general purpose registers
to store the predicate, whereas the EVP features eight dedicated predicate registers. The negated
predicate has to be computed explicitly for both processors. The ARM uses condition code flags
for predication. It can store one condition at a time in the status register and supports negation.
Thus, each processors belongs to one of the groups mentioned in Section 9.4. The maximum
VLIW parallelism available in the EVP equals five vector-, four scalar-, three address-operations
and loop-control. The TriMedia can process up to five operations in parallel. The EVP jumps
9.6. Experimental Results 121
have 5-7 delay slots while the TriMedia jumps have two. In contrast, the ARM is a RISC like
core. Since the ARM has no delay slots the splitting mechanism was disabled. The only benefit
by PE for the ARM lies in the elimination of jump instructions.
ARM EVP TriMedia
if-stmts recognized converted split if-stmts recognized converted split if-stmts recognized converted split
adpcm 24 16 16 - 18 16 16 0 26 22 12 16
viterbi 43 40 38 - 2 0 0 0 6 2 2 3
median 53 51 51 - 13 13 13 0 14 13 13 0
wave 62 59 58 - 3 3 3 0 7 3 3 3
idct 65 64 63 - 16 16 15 1 20 16 8 11
cjpeg 1360 143 88 - 1994 307 202 1555 2870 442 142 1636
djpeg 1118 118 89 - 1934 306 206 1554 2894 441 143 1662
printf 198 33 22 - 97 30 16 66 118 47 43 67
miniLzo 63 2 1 - 54 3 2 42 142 9 3 88
Table 9.2: If-Statement statistics for all benchmarks
The benchmarks consists of some smaller, typical signal processing kernels (up to 70 ITE state-
ments) as well as some larger and more complex applications (up to 2000 ITE statements). The
total number of if-statements vary between the compilers due to their different design and inte-
grated optimizations. Table 9.2 shows detailed statistics for the total number of ITE statements,
those that are recognized by PEpreproc and how many have been finally converted and split,
respectively. If not stated otherwise, the test data that comes with these benchmarks is used for
the measurements and it is optimized for the worst-case execution time.
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
adpcm viterbi median wave idct
S
p
e
e
d
u
p
fa
c
to
r
ARM EVP TriMedia
(a) Speedup for small benchmarks
0
0,2
0,4
0,6
0,8
1
1,2
1,4
cjpeg djpeg printf miniLzo
S
p
e
e
d
u
p
fa
c
to
r
ARM EVP TriMedia
(b) Speedup for large benchmarks
Figure 9.7: Speedup factors for both benchmark groups
For the small benchmarks, PEpreproc determines that on average 80% of all if-statements can be
considered for PE, the only exception being the viterbi [110] for the EVP with no predicable
122 Chapter 9. Predicated Execution
if-statements. Almost all these if-statements could finally be converted for the EVP, whereas the
TriMedia could not convert all of them. This is mainly due to the higher degree of parallelism
the EVP offers over the TriMedia. Thus, the chance is higher in TriMedia for resource conflicts
resulting in longer schedules and hence, higher costs for predicated if-statements. Consequently,
more if-statements are split for the TriMedia than for the EVP. Figure 9.7(a) shows high speedups
for the VLIW processors, whereas the ARM shows smaller speedups.
The programs cjpeg and djpeg [49, 135] feature a large amount of if-statements (around 2000),
however only approximately 15% of them were recognized by PEpreproc for if-conversion. Finally,
only 6−10% of all if-statements could be converted by the compilers. Here, the splitting mechanism
proves advantageous and handles nearly 80% (EVP) and 60% (TriMedia) of all if-statements.
The ARM shows only marginal speedups due to the disabled splitting mechanism, but EVP and
TriMedia show good speedups for both cjpeg and djpeg (Figure 9.7(b)). The obtained speedups
are less significant than for the small kernels. This is understandable considering that the cycles
spent in the runtime library for file operations dominate the execution time.
Considering the printf (implementation is shipped with CoSy) application, it contains many if-
statements (around 100), approximately 17% are converted and around 60% are split by the EVP
and TriMedia compilers. No results are reported for the ARM, since it could not be compiled due
to a different runtime library setup.
For miniLzo [173], although it features many if-statements (around 80), only a few could be con-
verted. A look into the source code revealed that the if-statements either contain function calls
or goto statements. These kind of if-statements are not allowed by PEpreproc and thus, no per-
formance improvement can be obtained. However, except for the ARM, the splitting mechanism
can be applied again and optimizes almost all if-statements.
On average, speedups of 1.2 for the ARM9, 1.5 for the EVP and 1.47 for the TriMedia can be
obtained.
For the code size, PE typically saves some instructions (jumps and nops), but may also generate
new ones (e.g. negated conditions). In general, code size is slightly reduced (see Figure 9.8).
The optimization algorithm itself has linear complexity (O(n) worst case complexity for n ITE
statements). Furthermore, it requires one additional tree pattern matcher pass and two additional
scheduler passes. For n IR nodes in the ITE statements the worst case complexity of tree pattern
matching is O(n) whereas for scheduling it is O(n2). Thus, the total worst case complexity is
quadratic.
9.7 Conclusions
In contrast to previous, largely target specific, code optimizations for predicated execution, this
thesis implements a retargetable approach in order to enable PE for a wide range of processor
9.7. Conclusions 123
0
0,2
0,4
0,6
0,8
1
1,2
ad
pcm vite
rbi
m
ed
ian
wa
ve idc
t
cjpe
g
djpe
g
pri
ntf
m
iniL
zo
C
o
d
e
s
iz
e
fa
c
to
r
ARM EVP TriMedia
Figure 9.8: Code size results for all benchmarks
architectures at limited manual effort. This is achieved by a retargetable predicated execution
extension for the CoSy compiler development system. This concept has been proven by generating
PE enabled compilers for embedded processors with different PE configurations. Generally, for all
processors good speedups and a slight code size reductions are achieved. The required retargeting
information are quite limited and its specification fits nicely into the Compiler Designer concept
(cf. Section 5.2). Thus, the integration enables a complete and retargetable path from a single
processor model, written in the LISA ADL, to a C compiler with PE optimization.
Further improvements in code quality seem possible. For instance, conditions of if-statements are
often composed of expressions combined with boolean operations which is mapped onto several
nested ITE statements. If the evaluation of the individual expressions is free of side effects, they
can be evaluated in parallel. This idea could be implemented by a new scheme for the PE engines.
Furthermore, a mechanism to enforce PE for certain ITE statements might be useful to enable
other optimizations, e.g. software pipelining, which are blocked by control flow.
Chapter 10
Assembler Optimizer
Some optimization can only be performed on the assembly level of the application. This chapter
presents a retargetable low-level, assembly code optimization interface that is generated from a
LISA description. Figure 10.1 illustrates the corresponding code generation flow.
Assembler
Backend
Linker
CCompiler C-CodeC-Code
Assembler
Frontend
Scheduler
Peephole
OptimizerIR
CFGDFG
Assembly
Code
Assembly
CodeExecutable
…Processor
Designer
Compiler
Designer
LISA 2.0 Description
regs
data
mem
prog
mem
pipeline control
prog
seq
IF/ID ID/EX/WB
LISA
As
se
m
bl
er
Op
tim
iz
e
r
Figure 10.1: Assembler Optimizer code generation flow
Using the LISA ADL, a stand-alone assembler is automatically generated that is able to perform
user-defined transformations or optimizations on assembly-level. In this way, standard assembly-
level optimizations only need to be implemented once and are automatically retargeted to a
given LISA model. The assembler optimizer provides an user-accessible, convenient Application
Programmer Interface (API) for accessing the assembler’s internal data structures. The API
supports control and data flow analysis, a prerequisite for most optimization techniques. This
enables the ASIP designer to implement optimizations addressing special ISA features such as
125
126 Chapter 10. Assembler Optimizer
• Peephole optimization
• Address code optimization
• Register (re-)allocation
• Coupling of register allocation and scheduling
• Bit-level manipulation instructions
• ...
The remainder of this chapter is arranged as follows. After the discussion of related work in
Section 10.1, Section 10.2 briefly describes the functions provided by the API. Section 10.3 and
10.4 present a scheduler and peephole optimizer which have been build as demonstrators. Finally,
Section 10.5 provides some results.
10.1 Related Work
The PROPAN system [58] is a retargetable framework for code optimizations and machine depen-
dant program analyses at the assembly level. Its main focus is post-pass optimization in order to
reuse existing software tool chains. It needs a separate target specification called TDL to retarget
the optimization modules. Several optimization modules based on integer linear programming
have been implemented and retargeted to different real world DSPs. It can also be used as a
platform for generic program analysis, e.g. to calculate worst case execution times.
A similar approach is the SALTO system [66]. Based on an ADL description of the target machine
it generates the functionality to build profiling, tracing or optimization tools. It is intended to
be part of a global solution for manipulating assembly-code, i.e. to implement low-level code
modifications as well as to provide a high-level code re-structurer with useful information collected
from assembly code and instruction profiling. However, it is more oriented towards general purpose
processors and many architectural specific properties of ASIPs cannot be modeled at all.
The LANCE compiler platform [198] supports the generation of a low level, assembly like repre-
sentation, called LLIR. Standard assembly-level optimizations only need to be implemented once
and are automatically available when LANCE has been retargeted to a new target architecture.
For instance, in [125] a bit-true dataflow analysis is performed on the LLIR which has been suc-
cessfully used to implement dedicated optimization for network processors supporting bit packing
instructions.
10.2 Application Programmer Interface
Nearly all optimization modules that can be built on top of the API require access to architec-
tural information. For instance, it might be necessary to recover the semantic of the instruction
10.3. Scheduler 127
currently parsed by the assembler. As an example, the API supports an easy way to extract which
register is used as destination and which as source operand or if the instruction is a control flow
instruction. Basically, this information is either directly extracted from the BEHAVIOR section or
from the SEMANTICS section, if available. Though only the latter gives precise information due
to the semantic gap mentioned in Section 6.1. Furthermore, this information is used to perform a
control and data flow analysis (cf. Section 3.3.1). However, reconstructing the CFG from compiled
and probably scheduled code is not an easy task. In particular, two features can complicate CFG
construction. Firstly, a destination address stored in a register (instead of a label or immediate
constant) introduce a level of uncertainty which may lead to spurious edges in the graph. Sec-
ondly, in case of scheduled code, jump delay slots complicate the process of finding the first and
last instruction of a basic block. These is even more difficult once the delay slots contain further
branches. The implemented algorithm, described in [127], can handle such problems. Neverthe-
less, the information whether the input code is scheduled or not must be passed as option to the
API before the analysis can be started. Afterwards, the functions to access and iterate control-
and data-flow graphs can be used.
Furthermore, it is possible to modify the instruction and sequences thereof. Since all architec-
tural information are available, each instruction element (register operand, VLIW slot etc.) that
corresponds to a LISA operation can be modified.
Finally, the API constitutes basic assembler-related functions like file I/O functions to read and
write assembly or object files. Using the API, the implementation of an assembler (without any
optimizations) is straight forward, basically just a main function containing a few function calls.
10.3 Scheduler
The current scheduler generated by the Compiler Designer tool [172] has several limitations.
All architectural information (instruction properties such as latencies, resource usage etc.) is
transferred to the scheduler via annotations in the compiler generated assembly code. Figure
10.1 illustrates this. Each instruction is encoded in three so called packs assembler directives.
That means, the scheduler cannot schedule handwritten assembly code which comes without these
annotations.
.packs "alu rrr;P1;C1;T2;",1
.packs "PC:(r,0);prog:(r,0);R15:(r,0);R6:(r,0);R1:(w,0);",2
.packs "add R1 , R15 , R6 ;; Add two register ",3
.packs "ld rr;P3;C3;T10;",1
.packs "PC:(r,0);data:(r,0);prg:(r,0);R1:(r,0);R2:(w,0);",2
.packs "lb R2 , R1 , 0 ;; Load signed byte",3
Listing 10.1: Annotated assembly code
add R1, R15, R6
lb R2, R1, 0
Listing 10.2: Normal
assembly code
128 Chapter 10. Assembler Optimizer
Of course, the user could add them manually, but since the syntax is quite complicated this is
time consuming and error prone. Now this information is also available through the API. Hence,
a new scheduler, based on the existing implementation, has been created which does not need
these annotations anymore (Figure 10.2). This allows a stand-alone, user friendly assembly-level
scheduling that is independent from a compiler on top of the flow.
10.4 Peephole Optimizer
As second demonstrator, a peephole optimizer [113, 41] has been implemented, called lpeep. It is a
classical optimization which runs after the compiler. Basically, it tries to improve the performance
of the target program by searching for a short sequence of target instructions and replacing it with
a better sequence. It can be easily implemented using the API functions to read and write assembly
files as well as those to remove, insert, or delete assembly lines or instructions in VLIW slots. No
scheduler or data and control flow functions are needed. The peephole optimizer is driven by
an user defined replacement library. However, the peephole optimizations are not automatically
generated as described e.g. in [114]. Implementation wise, the library puts an abstraction layer
on top of the API in order to reuse large parts of the optimizer for different target architectures.
Thus, a peephole optimizer can actually be generated for any LISA model. The library is then
used to retarget the optimizer to the given target. The input of lpeep is either the assembly code
produced by a compiler or a hand-crafted assembly program.
10.4.1 Replacement Library
The replacement library describes the assembly patterns and their related replacement. Each
entry, called replacement rule, consists of three parts: the variable definitions, the original section,
and the replacement section. Figure 10.2 gives an example. Generally, variables are registers or
immediate values which can be used in assembly instructions. The original section is used to find
matching lines in the source file, which are then replaced by the pattern defined in the replacement
section. Inside the patterns the assembly syntax of the target architecture is used. Since the API
provides all architectural information quite detailed assembly patterns can be specified. This is
described in the next sections.
TRANSFORM ( <variable list> ) {
<original section>
} TO {
<replacement section>
}
Figure 10.2: Replacement rule
10.4. Peephole Optimizer 129
Variable Definitions
The different types which can constitute a variable are described in the following.
REGISTER: A register variable can either match all registers of the target architecture or only
an user defined subset:
REGISTER <variable name> [ = (<reg1>, <reg2>, ...) ]
A simple example is given in Listing 10.3 in which variable a can only match the registers
in the given set (as defined in the LISA model). Internally, lpeep make the assumption that
each register variable relates to a different register, i.e. a, b and c must match different
registers.
TRANSFORM (REGISTER a=(R1,R2,R3,R4),
REGISTER b, REGISTER c) {
a = b;
a = c;
} TO {
a = c;
}
Listing 10.3: Register variable example
TRANSFORM (REGISTER a, OPERAND b,
OPERAND c, BLOCK d) {
a = b;
BLOCK d (DONT READ a);
a = c;
} TO {
BLOCK d;
a = c;
}
Listing 10.4: Block variable example
IMMEDIATE: A variable of this type will match immediate values occurring in the source file.
This can be simple numerical values, symbolic labels or arithmetic expressions. Furthermore,
the user can specify conditions for the value.
IMMEDIATE <variable name> [ [==, !=, <, >] <value> ]
OPERAND: These variables will match both registers and immediates. It is introduced for
convenience for matching those instructions with similar assembly syntax for immediates and
registers. However, conditions are not available in the definition of the operand variables.
The variables discussed so far can be used to replace single lines or fixed-length sequences of lines.
lpeep also offers features to define rules that can also change the control flow of the assembly code.
This includes variables to match labels or variable-length sequences of lines. Since the detailed
behavior described in LISA are also available through the API, it is possible to specify conditions
for the resource usage of the instructions matched by the wildcard. Such a features is not available
in traditional peephole optimizers.
BLOCK: The block variable is the most complex variable type provided by lpeep. It is used as
wildcard in the original section to match one or more assembly instructions. The user can
control the block match criteria by adding a list of constraints to the block variable (Figure
10.4). Valid constraints are:
130 Chapter 10. Assembler Optimizer
1. DONT READ <register variable> || (<reg1>, <reg2>, ...)
This constraint will exclude instructions containing read accesses of the specified reg-
ister variable or physical registers from the match.
2. DONT WRITE <register variable> || (<reg1>, <reg2>, ...)
Same as previous except for write access.
3. DONT ACCESS <register variable> || (<reg1>, <reg2>, ...)
The combination of the DONT READ and DONT WRITE constraints.
4. DONT MATCH ( <assembly statement> )
This constraint will exclude any lines that match the given pattern from the block
match.
5. MAX LINES <number of lines>
This constraint will limit the number of matched instructions.
LABEL and NEWLABEL: This variable type matches labels. The NEWLABEL variables
can only be used in the replacement section of a rule to create a new label with an unique
name. An example is provided in Listing 10.5.
TRANSFORM (LABEL l1, LABEL l2,
BLOCK b1, BLOCK b2) {
jmp l1;
BLOCK b1;
LABEL l1;
jmp l2;
BLOCK b2;
LABEL l2;
} TO {
jmp l2;
BLOCK b1;
LABEL l1;
BLOCK b2;
LABEL l2;
}
Listing 10.5: Label variable example
TRANSFORM (REGISTER a,
REGISTER b,
REGISTER e)
{
a = b; | | EXTRA SLOTS c (DONT ACCESS a);
a = e; | | EXTRA SLOTS d (DONT ACCESS a);
} TO {
a = e; | | EXTRA SLOTS c | | EXTRA SLOTS d;
}
Listing 10.6: VLIW pattern example
Matching VLIW Instructions
To define replacement patterns for the optimization of VLIW assembly code lpeep provides the
|| operator to separate the different slots of a VLIW instruction. Figure 10.6 illustrates the use
of the || operator. The EXTRA SLOTS keyword is supported by lpeep to be used as wildcards
in a VLIW instruction word. Similar to the BLOCK variables, EXTRA SLOTS can also take
10.5. Experimental Results 131
constraints to restrict the matched instructions. All the constraints definition available for the
BLOCK variables are supported in the definition of the EXTRA SLOTS as well.
10.5 Experimental Results
The API and the presented modules are fully integrated into CoWare’s Processor Designer en-
vironment and thus, can be generated for any LISA model. For evaluation the compiler with
generated code selector description for the ST220 as presented in Chapter 6 has been used. Nat-
urally, as this configuration contains only automatically generated rules, there is obviously some
room for improvements which can be exploited by the peephole optimizer. Additionally, it is a
good candidate to show the applicability of the peephole optimizer as it features VLIW slots and
different constraints on registers as well as LISA resources.
The replacement library for the ST220 contains 26 patterns in total. Note, its main purpose
was to cover all features of the peephole optimizer and thus, it cannot be considered an optimal
replacement library. The API based scheduler contains some minor improvements as compared
to the existing scheduler. Basically, implicit register accesses can be directly detected by the
dependency analysis. Such dependencies must be explicitly modeled in the existing scheduler
which results typically in a conservative scheduler description.
75%
80%
85%
90%
95%
100%
105%
adpcm dct fht fir sieve viterbi
R
e
l.
c
y
c
le
c
o
u
n
t
in
%
old scheduler new scheduler new scheduler+peephole optimizer
Figure 10.3: Relative cycle count
In case of the ST220, this limitation prevented delay slot filling in certain cases. This caused
quite some NOP instructions at the end of a basic block. With the improved dependency analysis
this drawback could be eliminated. The improvements in cycle count (Figure 10.3) gained by the
API based scheduler range from 0% to 7%. Consequently, as less NOPs are required, code size is
decreased up to 11%. The improvements in cycle count achieved by the peephole optimizer range
between 1% and 16%, code size can be reduced by 5% to 19%.
132 Chapter 10. Assembler Optimizer
0%
20%
40%
60%
80%
100%
120%
adpcm dct fht fir sieve viterbi
R
e
l.
c
o
d
e
s
iz
e
in
%
old scheduler new scheduler new scheduler+peephole optimizer
Figure 10.4: Relative code size
10.6 Conclusions
The integration of a retargetable assembler optimizer API into an ADL based design environment
enables an convenient way to implement assembly level optimizations. Retargetable optimizations
based on the API can be easily added to the environment or ASIP designers can implement
their own hand-crafted optimizations. The interface provides all information (e.g. data- and
control flow information) which are typically required for such optimizations. Most important, all
architectural information such as processor resources, instruction semantics, etc. are still available
through the interface. In this way, optimization for irregular architecture features can be quickly
implemented. To demonstrate the applicability of the API, a scheduler and peephole optimizer
have been implemented. Since both tools are retargetable, they are already integrated into the
software tool generation flow of the Processor Designer and thus, can be generated for any LISA
model. In future, more retargetable assembly level optimization could be added to this flow.
Chapter 11
Summary and Outlook
The complexity of todays SoC designs is increasing at an exponential rate due to the combined
effects of advances in semiconductor technology as well as demands from increasingly complex
applications in embedded systems. Escalating NRE costs have created a shift towards achieving
greater design reuse with programmable SoC platforms. The choice of programmable architec-
tures strongly affects the success of a SoC design due to its impact on the overall cost, power
consumption, and performance. Therefore, an increasing number of embedded SoC designs
employ ASIPs as building blocks due to their balance between flexibility and high performance
by programmability and application specific optimizations. However, given today’s tight time-
to-market constraints, finding the optimal balance between competing design constraints makes
design automation inevitable.
Architecture description languages have been established as an efficient solution for ASIP archi-
tecture exploration. Among the main contributions of such languages is the automatic generation
of the software toolkit from a single ADL model of the processor. A key component of the
software toolkit is the C compiler which enables a compiler-in-the-loop design space exploration.
Developing an ADL, though, is a difficult task. Today’s ADLs must keep all architectural
information as required for the tool generation (in particular compiler and simulator) in an
unambiguous and consistent way. As a result, some ADLs are well-suited for e.g. the automatic
generation of the compiler, but impose major restrictions on, or are incapable of the generation
of a simulator. Other ADLs suffer from limited architectural scope and are not suitable for ASIP
design. An overview of existing ASIP design platforms and their capabilities is given in this
thesis. It turned out that none of the existing approaches solves this problem satisfactory.
This thesis proposes a technique that enables the automatic retargeting of a C compiler, more
specifically the code selector description, from an ADL processor model using CoWare’s Processor
Designer and the CoSy environment. The presented approach incorporates a new, concise
133
134 Chapter 11. Summary and Outlook
formalism for the description of instruction semantics into the LISA language definition. Several
existing LISA models for representative embedded processors have been successfully enhanced
with the new section at moderate effort. This proves that the new section does neither impose any
particular modeling style nor does it limit LISA’s flexibility. The instruction’s semantics is used
by four different mapping rule generation methods which create the code selector description for
a C compiler fully automatically. The CoSy compilers with generated code selector description
show an overhead of 14% in cycle count and 48% in code size as compared to a compiler with
(non-optimized) hand-crafted code selector specification. These are acceptable values considering
that an compiler is available early in the architecture exploration phase. This is crucial to avoid
hardware/software mismatches right from the start in order to ensure good overall efficiency
of SoC platforms. Moreover, the entry barrier to compiler generation is further lowered. In
fact, even non compiler experts are now able to generate compilers for architecture exploration.
Additionally, the generated code selector rules are correct by construction which eliminates the
tedious debugging of code selector descriptions.
ASIP design platforms employ retargetable C compilers for compiler generation since they can
be quickly adopted to varying processor configurations. Unfortunately, such compilers are known
for their limited code quality as compared to hand-written compilers or assembly code due to a
lower amount of target specific optimizations. This is not surprising considering that it would be
counterproductive for the flexibility required to adapt quickly to architectural alternatives. Like
it has been observed in the code quality analysis of the ST220 compilers, the generated compilers
must be manually refined with dedicated optimizations once the ASIP architecture exploration
phase has converged and an initial working compiler is available. Hence, the second part of this
thesis focuses on target processor classes which, due to their architectural features, demand for
specific code optimization techniques. Two promising architectural classes are selected, namely
processors equipped with SIMD instructions and those with Predicated Execution support.
This thesis implements these specific techniques such that retargetability within the given pro-
cessor class is achieved. The SIMD optimization was retargeted to two embedded processor
architectures with SIMD support. In general, the optimization achieves speedups of 7% to 66%
and code size reductions of up to 40% in most cases. The Predicated Execution optimization was
retargeted to three contemporary processors. On average, it achieves a cycle count improvement
of 39% and a code size reduction of 3%.
In this way, a complete and retargetable path from a single LISA processor model to a SIMD and
Predicated Execution enabled compiler for efficient compiler-in-the-loop architecture exploration
is achieved (Figure 11.1). Furthermore, to ease the manual creation of dedicated optimizations
on the assembly level, this thesis implements a new retargetable assembler which provides
an interface for code optimizations. A scheduler and peephole optimizer are implemented as
demonstrators.
135
Optimizer #2Optimizer #2
.c.c
Processor
Designer
ISSISS
.CGD
+SIMD
+ Predicated
Execution
.CGD
+ SIMD
+ Predicated
Execution
Compiler Designer
C compilerC compiler
.c
......
Design
goals
met ?
Design
goals
met ? LinkerLinker
Code selectorCode selector
SchedulerScheduler
Code emitterCode emitter
LISA 2.0 Description
regs
data
mem
prog
mem
pipeline control
prog
seq
IF/ID ID/EX/WB
LISA
SEMANTICS
SIMDSIMD
Optimizer #1Optimizer #1
AssemblerAssembler
PEPE
SIMD, PESIMD, PE
Target
Specific
Library
Basic
Library
One-to-one One-to-many
Many-to-one Intrinsics
Mapping Rule Generation
Nonterminal GenerationNonterminal Generation
Figure 11.1: Retargetable code generation based on LISA
Future research aims at different directions. Tomorrow’s SoC designs are heading towards hetero-
geneous multiprocessor systems (MP-SoC). Additionally, there is increasing amount of embedded
processor architectures which are capable of execution multiple threads of control in parallel.
Apart from the general problem of identifying those parts of a sequential code like C which can
executed in parallel, there is ongoing work to extend retargetable compilers in such a way that
all optimizations perform equally well on sequential, as well as parallel code constructs in an
multi-threaded environment. Another recent trend in embedded processor design is a clustered
VLIW organization. Compilers for such architectures must find a cluster assignment so that a
good workload balance is achieve while keeping the communications costs between the clusters
low. Developing retargetable techniques to support the efficient exploration of such architectures
is an interesting topic. Future research also aims at finding new methodologies for DAG based
code selection. This enables the direct exploitation of inherently parallel hardware instructions,
which are a very common extension of ASIP processors, by compilers. Another topic is the iden-
tification of those data flow trees or graphs which actually could be promising candidates to be
implemented in hardware.
Appendix A
Semantics Section
The SEMANTICS section of the LISA language provides a simple, straight-forward syntax, which
allows the direct transformation of the instruction’s purpose into an as-short-as-possible semantical
description. The complete grammar specification is given in Section A.3. A SEMANTICS section
basically consists of one or more semantics statements.
A.1 Semantics Statements
Four kinds of semantics statements are available. The assignment statement is mainly used for
data computations, whereas the if-else statement models control flow in SEMANTICS sections.
The mode statement encapsulates the register and immediate operands of the micro-operations.
Everything else is covered by the non-assignment statement.
A.1.1 Mode Statements
Only certain processor resources are meaningful for the SEMANTICS section. The relevant re-
sources must usually be wrapped into LISA operations. It then defines the semantical type of the
respective resource. These semantical types are called modes. Two kinds of modes are available,
namely register and immediate.
• The register mode defines (allocatable) register resources as specified in the LISA resource
section. The bit-width of the resource should be specified by the general bit specifications
(Section A.2.8). The registers’ names are derived from the SYNTAX section.
• The immediate mode defines an immediate value which is part of the instruction coding.
The bit-width is therefore defined by the CODING section and hence, does not need to be
specified.
137
138 Appendix A. Semantics Section
OPERATION imm8{
DECLARE{
LABEL value;
}
CODING{ value = 0bx[8] }
SEMANTICS{
IMMI(value);
} ...
}
OPERATION reg32{
DECLARE{
LABEL index;
}
SYNTAX { "R"˜index }
SEMANTICS{
REGI(GPR[index])<0..31>;
} ...
}
Listing A.1: Mode statement examples
Listing A.1 provides an example for both modes. The operation imm8 wraps an 8-bit instruction
coding which can be used as immediate value. The bit-width information results from its CODING
section. The operation reg32 is a wrapper for the processor resource named GPR (i.e. the register
file). The label index is used to index the registers. The number of available registers is derived
from this label, e.g. a 4-bit binary number can index 16 registers. According to the SYNTAX
section of operation reg32 these registers are named R0 ... R15.
A.1.2 Assignment Statement
An assignment statement is a semantics statement of the form <source> − > <destination>. It
performs either some computations defined by micro-operators (Section A.2) and stores the result
in the destination or just moves the data from the source to the destination.
The source expression of the assignment statement must produce a result. For instance, the
user cannot put a NOP micro-operations at the left hand side of the arrow, because it does
not produce any result. Moreover, the destination of an assignment statements cannot be an
arbitrary micro-operation expression. Only reasonable data sinks in an architecture can be used
as destination (e.g. status flags, registers).
An example is given in Listing A.2. In operation ADD the LISA groups src1,src2 and dst are
declared and used in the assignment. This group contains only one member, operation reg32,
whose SEMANTIC section has a REGI mode statement.
A.1. Semantics Statements 139
OPERATION ADD {
DECLARE{
GROUP src1 = { reg32 };
GROUP src2 = { reg32 };
GROUP dst = { reg32 };
} ...
SEMANTICS {
ADD(rs1,rs2) −> dst;
}
}
OPERATION reg32{
DECLARE{
LABEL index;
} ...
SEMANTICS {
REGI(GPR[index])<0..31>;
} ...
}
Listing A.2: Assignment statement examples
Considering the bit-widths of both sides in the assignment statement, they must be the same. An
error will be issued if a mismatch exists. For resources, the bit-width can usually be extracted
from the LISA descriptions, e.g. the RESOURCE section. However, the user can always specify the
bit-width explicitly. For micro-operators, the bit-width is either determined by the input operands
or given by the user. Please refer to Section A.2 for details on the micro-operation result’s width.
Here are some more examples of assignment statements:
/∗ Rs1, Rs2, Dest are 32−bit registers declared in modes ∗/
/∗ R16 is a 16−bit register ∗/
ADD(Rs1, Rs2) −> Dest;
/∗ Correct, since ADD returns a 32−bit result and Dest is 32−bit ∗/
MULUU(Rs1, R16) −> Dest;
/∗ Error!! since the MULUU returns a 48−bit result (32+16)
and the Dest is only 32−bit long. ∗/
MULUU(Rs1, R16)<0..31> −> Dest;
/∗ Correct, since the bit−specifications are made. ∗/
Listing A.3: Assigment statements and bit-width restrictions
For simple operations, like an addition for instance, a single statement should be sufficient, while
for more complex operations chaining or using if-else blocks may be necessary.
140 Appendix A. Semantics Section
A.1.3 IF-ELSE Statements
Similar to the C language’s if-else statement, the semantics if-else statement is used to model
conditional execution in the SEMANTICS section. Nested IF-ELSE statements are currently not
supported.
Ten predefined comparison micro-operators are available (Table A.1). Each of these comparison
operators returns either true or false, depending on the result. They can only be employed within
IF-ELSE conditions.
Keyword Comparison
EQ Equal
NE Not Equal
GTI Signed Greater Than
GTU Unsigned Greater Than
GEI Signed Greater Equal Than
GEU Unsigned Greater Equal Than
LTI Signed Less Than
LTU Unsigned Less Than
LEI Signed Less Equal Than
LEU Unsigned Less Equal Than
Table A.1: Comparison Keywords
Listing A.4 shows the SEMANTICS for an instruction which branches to the address in register
dst if the zero flag is set. The PC denotes the program counter resource.
OPERATION BRZ{
...
SEMANTICS {
IF ( EQ( ZF, 1)) {
dst −> PC;
}
}
}
Listing A.4: If-Else statement example
To form more complex condition, conditions can be concatenated by ’||’ or ’&&’. Like in the C
language, the former means logical or of the conditions on its both side, and the other represents
a logical and. The condition expression is evaluated from left to right and the two symbols are of
equal priority, which means that the expressions besides the left most symbol are evaluated first,
yet brackets can be used to override this relation.
A.2. Micro-operators 141
A.1.4 Non-Assignment Statements
The non-assignment statement refers to all statements which do not carry out data assignments,
predicated execution or resource encapsulating. Such statements are commonly used in the LISA
operation hierarchy. The operation int alu in Listing A.5 contains a SEMANTICS section using
the non-assignment statement opcode;. There is only a single LISA group named opcode in the
statement, i.e. the semantics of this operation is provided by the member of this group.
OPERATION int alu{
DECLARE{
GROUP opcode = { ADD | | SUB };
} ...
SEMANTICS{ opcode; }
}
OPERATION ADD{
...
SEMANTICS{ ADD(rs1,rs2)−>dst; }
}
Listing A.5: Non-assignment statement example
A.2 Micro-operators
The micro-operations provided in the list below are a basic set of operators as used for compiler
generation. As stated in chapter 6, the set of micro-operations is designed to be concise and
compact. However, it might be necessary to extend the following set of micro-operations for other
architectures. In particular, floating point support is entirely left out.
First of all, the notations which are used in the following sections are introduced. Afterwards, each
micro-operator is described in an instruction-set manual like manner. For certain cases, detailed
examples are provided. The micro-operators are grouped in terms of their functionalities. Side-
effects of the micro-operators are modeled as the affected flag declarations. They are explained
later in this chapter as well as the general bit specifications.
A.2.1 Notations
Offset: Bit position indication. (The position starts from zero.)
Width: The width of the bit-extraction.
BITMASK(offset, width): Generates a bitmask where the bits starting from position offset with the
width are filled with 1, and 0 in the remaining bits. BITMASK(3,4) = 0b01111000
BIT EXTRACTIONS(value, offset, width): (value) & BITMASK( offset, width )
CF: Carry flag
142 Appendix A. Semantics Section
ZF: Zero flag
OF: Overflow flag
NF: Negative flag
CF SET: Returns 1 if the carry flag is set to be affected as side-effect. Otherwise 0
ZF SET: Returns 1 if the zero flag is set to be affected as side-effect. Otherwise 0
OF SET: Returns 1 if the overflow flag is set to be affected as side-effect. Otherwise 0
NF SET: Returns 1 if the negative flag is set to be affected as side-effect. Otherwise 0
operandn: Operands of the micro-operators. Each operand has three components: value, offset and
width. Value represents the actual content of the operand. Offset and Width indicate a bit-extraction
process. The final result of the operand will be the extracted bits from the value. In the Operation of
each micro-operator’s description, the index n is used to seperate different components, e.g., operand1 is
composed of value1, width1 and offset1.
ISSUE ERROR( MISMATCH): Mismatch error of the operands’ bit-width is thrown.
ZF SIDE EFFECT(result): If the result is zero, returns 1. Otherwise returns 0.
NF SIDE EFFECT(result, width): If bit[width - 1] is 1 (negative value), returns 1. Otherwise
returns 0.
OF SIDE EFFECT ADD(op1, op2, width): Returns 1 if the addition specified as its parameter
causes a (width)-bit signed oveflow. Addition generates an overflow if both operands have the same sign
(bit[width - 1]), and the sign of the result is different to the sign of both operands.
CF SIDE EFFECT ADD(op1, op2, width): Returns 1 if the addition specified as its parameter
causes a carry (true result is bigger than 2width−1, where the operands are treated as unsigned integers),
and returns 0 in all other cases.
OF SIDE EFFECT SUB(op1, op2, width): Returns 1 if the subtraction specified as its parameter
causes a (width)-bit signed oveflow. Subtraction causes an overflow if the operands have different signs,
and the first operand and the result have different signs.
CF SIDE EFFECT SUB(op1, op2, width): Returns 0 if the subtraction specified as its parameter
causes a borrow (the true result is less than 0, where the operands are treated as unsigned integers), and
returns 1 in all other cases.
OF SIDE EFFECT MULUU(op1, op2, width1, width2, width): If the multiplication result of
the unsigned number op1 (width1) and the unsigned number op2 (width2) exceeds the unsigned range
that width bits can take, returns 1. Otherwise returns 0.
OF SIDE EFFECT MULIU(op1, op2, width1, width2, width): If the multiplication result of
the signed number op1 (width1) and the unsigned number op2 (width2) exceeds the signed range that
width bits can take, returns 1. Otherwise returns 0.
OF SIDE EFFECT MULII(op1, op2, width1, width2, width): If the multiplication result of the
signed number op1 (width1) and the signed number op2 (width2) exceeds the signed range that width
bits can take, returns 1. Otherwise returns 0.
OF SIDE EFFECT NEG(result, width): The only case of overflow for negative micro-operation
A.2. Micro-operators 143
happens when the maximum negative value is taken as operand. For example, for 4-bit signed values, the
max negative value is 0b1000 (-8). Taking the negative value of this one gives 0b1000 which is incorrect
because the max positive value is +7. Returns 1 if this case happens, otherwise returns 0.
A.2.2 Group of arithmetic operators
This group of micro-operators deals with the arithmetic instructions that appear in most of the
processor architectures. Some of the micro-operators need to work with flags, reading flags and/or
writing flags as side-effects.
<arithmetic_uop> := _ADD | _ADDC | _SUB | _SUBC | _NEG | _MULUU | _MULIU | _MULII
ADD
Description Adds two operands
Syntax ADD[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags CF, ZF, NF, OF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 + temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result);}
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (OF_SET) { OF = OF_SIDE_EFFECT_ADD(temp1, temp2, width1); }
if (CF_SET) { CF = CF_SIDE_EFFECT_ADD(temp1, temp2, width1); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_ADD[_C, _Z, _N, _O](0x00100010<0..31>, 0x00010001<0..31>)<0..31>: 110011 (ZF:0 NF:0 OF:0 CF:0)
_ADD[_C, _Z, _N, _O](0x00100010<0..15>, 0x00010001<0..15>)<0..15>: 11 (ZF:0 NF:0 OF:0 CF:0)
_ADD[_C, _Z, _N, _O](0x00108010<0..15>, 0x00010001<0..15>)<0..15>: 8011 (ZF:0 NF:1 OF:0 CF:0)
_ADD[_C, _Z, _N, _O](0x00100001<0..15>, 0x0000ffff<0..15>)<0..15>: 0 (ZF:1 NF:0 OF:0 CF:1)
ADDC
Description Adds two operands with carry
Syntax ADDC[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags CF, ZF, NF, OF
144 Appendix A. Semantics Section
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 + temp2 + CF), offset, width) >> offset;
if (OF_SET) { OF = OF_SIDE_EFFECT_ADD(temp1 + CF, temp2, width1); }
if (CF_SET) { CF = CF_SIDE_EFFECT_ADD(temp1 + CF, temp2, width1); }
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:1)
_ADDC[_C, _Z, _N, _O](0x00100010<0..31>, 0x00010001<0..31>)<0..31>: 110012 (ZF:0 NF:0 OF:0 CF:0)
_ADDC[_C, _Z, _N, _O](0x00100010<0..15>, 0x00010001<0..15>)<0..15>: 11 (ZF:0 NF:0 OF:0 CF:0)
_ADDC[_C, _Z, _N, _O](0x00108010<0..15>, 0x00010001<0..15>)<0..15>: 8011 (ZF:0 NF:1 OF:0 CF:0)
_ADDC[_C, _Z, _N, _O](0x00100001<0..15>, 0x0000ffff<0..15>)<0..15>: 0 (ZF:1 NF:0 OF:0 CF:1)
SUB
Description Subtracts the operand2 from operand1
Syntax SUB[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags CF, ZF, NF, OF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 - temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (OF_SET) { OF = OF_SIDE_EFFECT_SUB(temp1, temp2, width1); }
if (CF_SET) { CF = CF_SIDE_EFFECT_SUB(temp1, temp2, width1); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_SUB[_C, _Z, _N, _O](0x00100010<0..31>, 0x00010001<0..31>)<0..31>: f000f (ZF:0 NF:0 OF:0 CF:1)
_SUB[_C, _Z, _N, _O](0x00100010<0..15>, 0x00010001<0..15>)<0..15>: f (ZF:0 NF:0 OF:0 CF:1)
_SUB[_C, _Z, _N, _O](0x00108010<0..15>, 0x00010001<0..15>)<0..15>: 800f (ZF:0 NF:1 OF:0 CF:1)
_SUB[_C, _Z, _N, _O](0x00100001<0..15>, 0x0000ffff<0..15>)<0..15>: 2 (ZF:0 NF:0 OF:0 CF:0)
_SUB[_C, _Z, _N, _O](0x00100001<0..31>, 0x0000ffff<0..31>)<0..31>: f0002 (ZF:0 NF:0 OF:0 CF:1)
A.2. Micro-operators 145
SUBC
Description Subtracts the operand2 from operand1 with carry
Syntax SUBC[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags CF, ZF, NF, OF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
// temp1 - temp2 - NOT(CF)
result = BIT_EXTRACTIONS((temp1 - temp2 - NOT(CF) ), offset, width) >> offset;
if (OF_SET) { OF = OF_SIDE_EFFECT_SUB(temp1 - NOT(CF), temp2, width1); }
if (CF_SET) { CF = CF_SIDE_EFFECT_SUB(temp1 - NOT(CF), temp2, width1); }
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_SUBC[_C, _Z, _N, _O](0x00100010<0..31>, 0x00010001<0..31>)<0..31>: f000e (ZF:0 NF:0 OF:0 CF:1)
MULUU
Description Multiplies the unsigned integer operand1 by unsigned integer operand2
Syntax MULUU[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions on the bit-widths of operands.
Result bitwidth The addition of the bit-widths of the operands.
Affected Flags ZF, OF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS(( (unsigned)temp1 * (unsigned)temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (OF_SET) { OF = OF_SIDE_EFFECT_MULUU(temp1, temp2, width1, width2, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_MULUU[_Z, _O](0x00100010<0..31>, 0x00010001<0..31>)<0..31>: 200010 (ZF:0 NF:0 OF:1 CF:0)
146 Appendix A. Semantics Section
MULIU
Description Multiplies the signed integer operand1 by unsigned integer operand2
Syntax MULIU[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions on the bit-widths of operands.
Result bitwidth The addition of the bit-widths of the operands.
Affected Flags ZF, NF, OF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
// check if op1 is negative
// if so, sign extends to 32 bit long (only for this program)
// if long is used, then replace 32 by 64
temp1 = SEM_SXT(temp1, 0, width1, 0, 32);
result = BIT_EXTRACTIONS(( (signed)temp1 * (unsigned)temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (OF_SET) { OF = OF_SIDE_EFFECT_MULIU(temp1, temp2, width1, width2, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_MULIU[_Z, _N, _O](0x8000<0..15>, 0x0010<0..15>)<0..31>: fff80000 (ZF:0 NF:1 OF:0 CF:0)
MULII
Description Multiplies the signed integer operand1 by signed integer operand2
Syntax MULII[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions on the bit-widths of operands.
Result bitwidth The addition of the bit-widths of the operands.
Affected Flags ZF, NF, OF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
// check if op1 and op2 are negative
// if so, sign extends to 32 bit long (only for this program)
// if long is used, then replace 32 by 64
temp1 = SEM_SXT(temp1, 0, width1, 0, 32);
temp2 = SEM_SXT(temp2, 0, width2, 0, 32);
result = BIT_EXTRACTIONS(( (signed)temp1 * (signed)temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (OF_SET) { OF = OF_SIDE_EFFECT_MULII(temp1, temp2, width1, width2, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_MULII[_Z, _N, _O](0x8000<0..15>, 0x8010<0..15>)<0..31>: 3ff80000 (ZF:0 NF:0 OF:0 CF:0)
_MULII[_Z, _N, _O](0x8000<0..15>, 0x8010<0..15>)<0..23>: f80000 (ZF:0 NF:1 OF:1 CF:0)
A.2. Micro-operators 147
NEG
Description Produces the negative value of the operand (twos-complement).
Syntax NEG[affected flag declarations](operand1)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of the operand.
Affected Flags ZF, NF, OF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
result = BIT_EXTRACTIONS( (-((signed)temp1)) , offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (OF_SET) { OF = OF_SIDE_EFFECT_NEG(temp1, width1); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_NEG[_Z, _N, _O](0x10<0..31>)<0..31>: fffffff0 (ZF:0 NF:1 OF:0 CF:0)
A.2.3 Group of logic operators
This group of micro-operators deals with the bitwise logic functions. Similar to the arithmetic
group, the operators can change the flags as a side-effect.
<logic_uop> := _AND | _OR | _XOR | _NOT
AND
Description Performs a bitwise AND operation on operand1 and operand2
Syntax AND[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags ZF, NF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 & temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
148 Appendix A. Semantics Section
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_AND[_N, _Z](0x0fff0fff<0..31>, 0x000f000f<0..31> 0, 32): f000f (ZF:0 NF:0 OF:0 CF:0)
_AND[_N, _Z](0x0ff00fff<0..31>, 0x000f000f<0..31> 0, 32): f (ZF:0 NF:0 OF:0 CF:0)
_AND[_N, _Z](0xfff00fff<0..31>, 0x000f000f<0..31> 0, 32): f (ZF:0 NF:0 OF:0 CF:0)
_AND[_N, _Z](0xfff00fff<0..31>, 0x800f000f<0..31> 0, 32): 8000000f (ZF:0 NF:1 OF:0 CF:0)
OR
Description Performs a bitwise OR operation on operand1 and operand2
Syntax OR[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags ZF, NF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 | temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_OR[_N, _Z](0x0fff0fff<0..31>, 0x000f000f<0..31>)<0..31>): fff0fff (ZF:0 NF:0 OF:0 CF:0)
_OR[_N, _Z](0x0ff00fff<0..31>, 0x000f000f<0..31>)<0..31>): fff0fff (ZF:0 NF:0 OF:0 CF:0)
_OR[_N, _Z](0xfff00fff<0..31>, 0x000f000f<0..31>)<0..31>): ffff0fff (ZF:0 NF:1 OF:0 CF:0)
_OR[_N, _Z](0xfff00fff<0..31>, 0x800f000f<0..31>)<0..31>): ffff0fff (ZF:0 NF:1 OF:0 CF:0)
XOR
Description Performs a bitwise XOR operation on operand1 and operand2
Syntax XOR[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions Two operands must be of the same bitwidth.
Result bitwidth Same as that of operands.
Affected Flags ZF, NF
Operation
if (width1 == width2) {
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS((temp1 ^ temp2), offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
A.2. Micro-operators 149
return result;
}
else {
ISSUE_ERROR(_MISMATCH);
}
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_XOR[_N, _Z](0x0fff0fff<0..31>, 0x000f000f<0..31>)<0..31>: ff00ff0 (ZF:0 NF:0 OF:0 CF:0)
_XOR[_N, _Z](0x0ff00fff<0..31>, 0x000f000f<0..31>)<0..31>: fff0ff0 (ZF:0 NF:0 OF:0 CF:0)
_XOR[_N, _Z](0xfff00fff<0..31>, 0x000f000f<0..31>)<0..31>: ffff0ff0 (ZF:0 NF:1 OF:0 CF:0)
_XOR[_N, _Z](0xfff00fff<0..31>, 0x800f000f<0..31>)<0..31>: 7fff0ff0 (ZF:0 NF:0 OF:0 CF:0)
NOT
Description Performs a bitwise NOT operation on operand
Syntax NOT[affected flag declarations](operand1)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of the operand.
Affected Flags ZF, NF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
result = BIT_EXTRACTIONS(~temp1, offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_NOT[_Z, _N](0x0fff0fff<0..31>)<0..31>: f000f000 (ZF:0 NF:1 OF:0 CF:0)
_NOT[_Z, _N](0x0ff00fff<0..31>)<0..31>: f00ff000 (ZF:0 NF:1 OF:0 CF:0)
_NOT[_Z, _N](0xfff00fff<0..31>)<0..31>: ff000 (ZF:0 NF:0 OF:0 CF:0)
_NOT[_Z, _N](0xfff00fff<0..31>)<0..31>: ff000 (ZF:0 NF:0 OF:0 CF:0)
A.2.4 Group of shifting operators
This group of micro-operators deals with the shifting functionality. Again, the micro-operators
may affect the flags (mainly carry flag).
<shifting_uop> := _LSL | _LSR | _ASR | _ROTL | _ROTR
150 Appendix A. Semantics Section
LSL
Description Performs a logical left shift operation on operand1 by operand2 bits. The
additional bits in dst are filled with zeros. The information in the operand2
leftmost bits is discarded if the user does not specify the affected flags.
Otherwise some flags (e.g. carry flag) is changed.
Syntax LSL[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of operand1.
Affected Flags CF, ZF, NF: if carry flag is specified in affected flags,
it is assumed that carry flag stores the last-moved bit
from the source. Zero and negative flag apply to the whole
value that is moved into destination.
Operation
temp1 = (unsigned)value1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
if (width1 <= ((unsigned)temp2 - 1) ) {
cerr << "Warning: left shift count >= width of type " << endl;
}
result = BIT_EXTRACTIONS(temp1 << (unsigned(temp2)), offset, width);
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (CF_SET) {
if (temp1 & (0x1 << (width1 - ((unsigned)temp2) ) ) ) { CF = 1; }
else { CF = 0; }
}
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_LSL[_C, _Z, _N](0x00ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00ff00 (ZF:0 NF:1 OF:0 CF:0)
_LSL[_C, _Z, _N](0x01ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00ff00 (ZF:0 NF:1 OF:0 CF:1)
_LSL[_C, _Z, _N](0x00ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00ff00 (ZF:0 NF:1 OF:0 CF:0)
_LSL[_C, _Z, _N](0x00ff00ff<0..31>, 0x10<0..31>)<0..31>: ff0000 (ZF:0 NF:0 OF:0 CF:1)
LSR
Description Performs a logical right shift on operand1 by operand2 bits. The new
operand2 bits to the left are filled with zeros. The information in the
operand2 rightmost bits is discarded if the user does not specify the affected
flags. Otherwise some flags (e.g. carry flag) is changed.
Syntax LSR[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of operand1.
Affected Flags CF, ZF, NF: if carry flag is specified in affected flags,
it is assumed that carry flag stores the last-moved bit
from the source. Zero and negative flag apply to the whole
value that is moved into destination.
A.2. Micro-operators 151
Operation
temp1 = (unsigned)value1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = BIT_EXTRACTIONS(temp1 >> (unsigned(temp2)), offset, width);
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (CF_SET) {
if ((temp1 >> (unsigned(temp2) - 1)) & (0x1)) { CF = 1; }
else { CF = 0; }
}
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_LSR[_C, _Z, _N](0x00ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00 (ZF:0 NF:0 OF:0 CF:1)
_LSR[_C, _Z, _N](0x01ff00ff<0..31>, 0x8<0..31>)<0..31>: 1ff00 (ZF:0 NF:0 OF:0 CF:1)
_LSR[_C, _Z, _N](0x00ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00 (ZF:0 NF:0 OF:0 CF:1)
_LSR[_C, _Z, _N](0x00ff00ff<0..31>, 0x10<0..31>)<0..31>: ff (ZF:0 NF:0 OF:0 CF:0)
ASR
Description Performs an arithmetic right shift on operand1 by operand2 bits. The new
operand2 bits to the left are filled with zeros or ones depending on the
leftmost bit before the shift operation. The information in the operand2
rightmost is discarded if the user does not specify the affected flags. Oth-
erwise some flags (e.g. carry flag) is changed.
Syntax ASR[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of operand1.
Affected Flags CF, ZF, NF: if carry flag is specified in affected flags,
it is assumed that carry flag stores the last-moved bit
from the source. Zero and negative flag apply to the whole
value that is moved into destination.
Operation
temp1 = (unsigned)value1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
if ((temp1 & (0x1 << (width1 - 1)) ) == 0 ) {
result = (temp1 >> ((unsigned)temp2) );
}
else {
result = (temp1 >> ((unsigned)temp2) ) | BITMASK( (width1 - (unsigned)temp2), (unsigned)temp2);
}
result = BIT_EXTRACTIONS(result, offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
if (CF_SET) {
if ((temp1 >> (unsigned(temp2) - 1)) & (0x1)) { CF = 1; }
else { CF = 0; }
}
152 Appendix A. Semantics Section
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_ASR[_C, _Z, _N](0x00ff00ff<0..23>, 0x8<0..31>)<0..23>: ffff00 (ZF:0 NF:1 OF:0 CF:1)
_ASR[_C, _Z, _N](0x01ff00ff<0..31>, 0x8<0..31>)<0..31>: 1ff00 (ZF:0 NF:0 OF:0 CF:1)
_ASR[_C, _Z, _N](0x80ff00ff<0..31>, 0x8<0..31>)<0..31>: ff80ff00 (ZF:0 NF:1 OF:0 CF:1)
_ASR[_C, _Z, _N](0x80ff00ff<0..31>, 0x10<0..31>)<0..31>: ffff80ff (ZF:0 NF:1 OF:0 CF:0)
ROTL
Description Rotational left shift on operand1 by operand2 bits. If the user specify some
flags as side-effects (e.g. carry flag), the carry flag is used as a buffer to do
the shifting.
Syntax ROTL[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of operand1.
Affected Flags CF, ZF, NF: if carry flag is specified in affected flags,
it is assumed that carry flag stores the last-moved bit
from the source. Zero and negative flag apply to the whole
value that is moved into destination.
Operation
temp1 = (unsigned)value1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = temp1;
if (CF_SET) {
if (temp1 & (0x1 << (width1 - ((unsigned)temp2) ) ) ) { CF = 1; }
else { CF = 0; }
}
for (u32 i = 0; i < ((unsigned)temp2); i++) {
if (!(temp1 & (0x1 << (width1 - 1)) )) {
result = result << 1;
}
else {
result = (result << 1) | 1;
}
temp1 = temp1 << 1;
}
result = BIT_EXTRACTIONS(result, offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_ROTL[_C, _Z, _N](0x00ff00ff<0..23>, 0x8<0..31>)<0..23>: ffff (ZF:0 NF:0 OF:0 CF:1)
_ROTL[_C, _Z, _N](0x01ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00ff01 (ZF:0 NF:1 OF:0 CF:1)
_ROTL[_C, _Z, _N](0x80ff00ff<0..31>, 0x8<0..31>)<0..31>: ff00ff80 (ZF:0 NF:1 OF:0 CF:0)
_ROTL[_C, _Z, _N](0x80ff00ff<0..31>, 0x10<0..31>)<0..31>: ff80ff (ZF:0 NF:0 OF:0 CF:1)
A.2. Micro-operators 153
ROTR
Description Rotational right shift on operand1 by operand2 bits. If the user specify
some flags as side-effects (e.g. carry flag), the carry flag is used as a buffer
to do the shifting.
Syntax ROTR[affected flag declarations](operand1, operand2)[bit extractions]
Restrictions No restrictions.
Result bitwidth Same as that of operand1.
Affected Flags CF, ZF, NF: if carry flag is specified in affected flags,
it is assumed that carry flag stores the last-moved bit
from the source. Zero and negative flag apply to the whole
value that is moved into destination.
Operation
temp1 = (unsigned)value1;
temp2 = BIT_EXTRACTIONS(value2, offset2, width2) >> offset2;
result = temp1;
if (CF_SET) {
if ((temp1 >> (unsigned(temp2) - 1)) & (0x1)) { CF = 1; }
else { CF = 0; }
}
for (u32 i = 0; i < ((unsigned)temp2); i++) {
if ( !(temp1 & 1) ) {
result = result >> 1;
}
else {
result = (result >> 1) | (0x1 << (width1 - 1));
}
temp1 = temp1 >> 1;
}
result = BIT_EXTRACTIONS(result, offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_ROTR[_C, _Z, _N](0x00ff00ff<0..23>, 0x8<0..31>)<0..23>: ffff00 (ZF:0 NF:1 OF:0 CF:1)
_ROTR[_C, _Z, _N](0x01ff00ff<0..31>, 0x8<0..31>)<0..31>: ff01ff00 (ZF:0 NF:1 OF:0 CF:1)
_ROTR[_C, _Z, _N](0x80ff00ff<0..31>, 0x8<0..31>)<0..31>: ff80ff00 (ZF:0 NF:1 OF:0 CF:1)
_ROTR[_C, _Z, _N](0x80ff00ff<0..31>, 0x10<0..31>)<0..31>: ff80ff (ZF:0 NF:0 OF:0 CF:0)
A.2.5 Group of zero/sign extension operators
This group of operators serve the purpose of zero/sign extensions. They do not have any effect
on the flags.
<extension_uop> := _SXT | _ZXT
154 Appendix A. Semantics Section
SXT
Description Performs a sign extension to the operand.
Syntax SXT(operand1)[bit extractions]
Restrictions No restrictions.
Result bitwidth The result bitwidth is determined by the bit-specs that follows the micro-
operator.
Affected Flags ZF, NF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
if ( width <= width1) {
cerr << "Warning: You are using a sign reduction in _SXT." << endl;
cerr << "Better directly use the bit specs." << endl;
}
if ( !(temp1 & (0x1 << (width1 - 1))) ) {
// MSB is 0
result = BIT_EXTRACTIONS(temp1, offset, width) >> offset;
}
else {
// MSB is 1
result = temp1 | BITMASK(width1, width - width1 );
result = BIT_EXTRACTIONS(result, offset, width) >> offset;
}
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
if (NF_SET) { NF = NF_SIDE_EFFECT(result, width); }
}
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_SXT[_Z, _N](0xff00<0..15>)<0..31>: ffffff00 (ZF:0 NF:1 OF:0 CF:0)
_SXT[_Z, _N](0x7f00<0..15>)<0..31>: 7f00 (ZF:0 NF:0 OF:0 CF:0)
_SXT[_Z, _N](0xff00<0..15>)<0..23>: ffff00 (ZF:0 NF:1 OF:0 CF:0)
ZXT
Description Performs a zero extension to the operand.
Syntax ZXT(operand1)[bit extractions]
Restrictions No restrictions.
Result bitwidth The result bitwidth is determined by the bit-specs that follows the micro-
operator.
Affected Flags ZF
Operation
temp1 = BIT_EXTRACTIONS(value1, offset1, width1) >> offset1;
if ( width <= width1) {
cerr << "Warning: You are using a sign reduction _ZXT." << endl;
cerr << "Better directly use the bit specs." << endl;
}
A.2. Micro-operators 155
result = BIT_EXTRACTIONS(temp1, offset, width) >> offset;
if (ZF_SET) { ZF = ZF_SIDE_EFFECT(result); }
return result;
Examples:
(ZF:0 NF:0 OF:0 CF:0)
_ZXT[_Z](0xff00<0..15>)<0..31>: ff00 (ZF:0 NF:0 OF:0 CF:0)
_ZXT[_Z](0x7f00<0..15>)<0..31>: 7f00 (ZF:0 NF:0 OF:0 CF:0)
_ZXT[_Z](0xff00<0..16>)<0..23>: ff00 (ZF:0 NF:0 OF:0 CF:0)
NOTE: There come some suggestions about writing the sign/zero extension/reduction in the
semantics section. If the user wants to do sign/zero extension which means to expand the bit-
width of the operand considering the sign bit, it should be read, e.g.
SXT( ADD(Rs1, Rs2)<0..7>)<0..15> -> Dest; /* Dest is 16 bit long. */.
This tells that the lower 8 bits of the addition result will be sign extended to 16 bits and later
transferred to destination register (which must be 16 bit otherwise errors are issued). Or it can
be transferred to the arbitrary bit locations of the destination registers as long as it makes sense,
e.g.
SXT( ADD(Rs1, Rs2)<0..7>)<0..15> -> Dest<16..31>;
It is assumed that the micro-operators SXT and ZXT will extend the operands to infinite long
and the truncations will be carried out by bit-width specifications, say, to 16 bits. The other case,
reduction, happens in ST220 model. Sign/Zero reductions simply mean to extract the lower bits
down. The user may write something like
SXT( ADD(Rs1, Rs2)<0..15>)<0..7> -> Dest; /* Dest is 8 bit long. */
but that is equivalent to
ADD(Rs1, Rs2)<0..7> -> Dest; /* Dest is 8 bit long. */.
It is recommended that the user follow the latter expression. Warnings may be issued in this case.
A.2.6 Others/Intrinsic operators
All the micro-operations that cannot be appropriately grouped in the above and the intrinsic
operations are listed here.
<other_uop> := _INDIR | _NOP | <intrinsic_uop>
INDIR
Description References a specific memory location pointed by operand. Can be used with
operation chaining for load and store operations, or any other instructions
that can use one or more memory operands.
Syntax INDIR( OR(Rs, SP))<Offset1..Offset1+Bits>;
Restrictions None.
Result bitwidth The bit-width of the result is determined by the bit-specs that follow the
micro-operator. Please refer the details below.
Affected Flags None.
156 Appendix A. Semantics Section
NOP
Description Do nothing.
Syntax NOP;
Restrictions None.
Result bitwidth None.
Affected Flags None.
<intrinsic op>
Description User-defined architecture-specific operations
Syntax ‘‘FFS’’;
Restrictions User-defined.
Result bitwidth User-defined.
Affected Flags User-defined for compiler knowledge.
NOTE: More about the INDIR formalizations and parameters follows:
_INDIR(Addr, Endianess = _LITTLE, char *AddressNameSpace)<x..y>;
The INDIR can take up to three parameters for accessing memory. The Addr is the location of
the memory unit that the user wants. The Endianess indicates which data organization/fashion
this micro-operation INDIR should follow. The address space is suitable in the case of multiple
addressing spaces. The bit-specification is used, e.g. loading a word from a byte-wise memory.
Examples:
INDIR(0x0, LITTLE, ‘‘DataMem’’)<0..31> -> Dest;
This operation will fill up the 32 bit destination register with the memory contents (memory
address space 1) {0x3}{0x2}{0x1}{0x0} provided that the base memory is byte-wise.
INDIR(0x0, BIG)<0..31> -> Dest;
This operation will fill up the 32 bit destination register with the memory contents (default
memory) {0x0}{0x1}{0x2}{0x3} provided that the base memory is byte-wise.
If there is only one bus in the LISA model, the AddressNameSpace can be omitted. Also it is
considered that the case
INDIR(0x0, LITTLE)<0..23> -> Dest;
also holds because the bits can be simply counted when filling up the destination.
A.2.7 Affected flag declarations
Definitions of the flags
• Carry Flag: Set by a carry out/borrow in at MSB
• Zero Flag: Set if entire byte, word, or long == 0
A.3. SEMANTICS Section Grammar 157
• Negative Flag: Set if sign bit == 1
• Overflow Flag: Set by a carry into sign bit w/o a carry out
<affected_flag_declarations> := ’[’ <flag> {’,’ <flag>} ’]’
<flag> := _C | _Z | _N | _O
The affected flag declaration is very important to portrait the side effects of the instructions that
occur in most processors. Here side effects are defined as the post-effects of the instructions, i.e,
the flags are changed due to the result of this instruction. (In contrast the common addition with
carry is handled by different micro-operations.) Currently there are four flags that are explicitly
supported in this semantical description: carry flag, zero flag, negative flag, overflow flag. For
example:
_ADD[_C, _Z](Rs1, Rs2) -> Rd;
This is interpreted as: use the pre-defined micro-operation ADD to perform addition of the two
operands and stores the result into Rd. Set zero flag if the result is zero; otherwise cleared. Set
carry flag if a carry is generated; otherwise cleared.
NOTE: if the user does not give the affected flag declarations, no flags will be changed after the
operation.
A.2.8 General bit specifications
Bit specifications generally apply to all the micro-operations (and the conditions in semantics
section) to indicate that the result should be kept to several bits rather than the whole width.
bit_extractions := integer...integer
Examples:
SEMANTICS { Immediate16 -> Rd<Offset,16>; }
SEMANTICS {
_INDIR(_OR(Rs,SP))<Offset0..Offset0+Bits> -> Rd<Offset1..Offset1+Bits>;
}
SEMANTICS { _REGI(GPR[taskNo][index])<0..31>; }
A.3 SEMANTICS Section Grammar
A.3.1 Grammar Notation
The keywords are denoted by using bolded font, e.g. ’KEYWORD’.
The nonterminal symbols are typeset slanted, e.g. ’nonterminal ’.
158 Appendix A. Semantics Section
If the syntax definition contains special caracters, they will be quoted with single quotes, e.g. ’}’.
Concatenation of two components is denoted by putting the components in sequence, e.g.
concatenation ::= element1 element2
Optional components are denoted by surrounding square brackets, e.g.:
optional ::= [ element ]
Repeating a component zero or more times is denoted with an asterisk, e.g.:
repeat ::= element*
Repeating a component one or more times is denoted with an plus, e.g.:
repeat ::= element+
Alternative components are denoted by vertical bars, e.g.:
alternative ::= option1 | option2 | option3
Brackets are used to group several elements, e.g.:
elements ::= ( element1 element2 )
Several elements separated with comma can use the same definition, e.g.:
element1, element2 ::= definition
A.3.2 SEMANTICS Grammar
Global Structure
semantics section ::= SEMANTICS ’{’ semantic statement+ ’}’
Semantic Statements
semantic statement ::= assignment statement
| if else statement
| modes statement
| non assignment statement
assignment statement ::= source expression ’->’ destination expression ’;’
A.3. SEMANTICS Section Grammar 159
source expression ::= micro operation expression
| integer
| LISA declared item
| semantics related resources
destination expression ::= LISA declared item
| indir expression
| semantics related resources
modes statement ::= regi mode | immi mode
regi mode ::= REGI ’(’ resource expression ’)’ ’<’ reg offset0 ’..’ reg offset1 ’>’ ’;’
resource expression ::= LISA declared item (’[’ LISA declared item ’]’)*
reg offset0, reg offset1 ::= integer
immi mode ::= IMMI ’(’ LISA declared item ’)’ ’;’
if else statement ::= IF ’(’ conditions ’)’
’{’ assignment statement+ ’}’
[ ELSE ’{’ assignment statement+ ’}’ ]
conditions ::= condition ( (’||’ | ’&&’) condition )*
| ’(’ conditions ’)’
condition ::= equal | not equal | signed greater | unsigned greater
| signed greater equal | unsigned greater equal
| signed less | unsigned less
| signed less equal | unsigned less equal
| CF | OF | NF | ZF
equal ::= EQ ’(’ compare operand ’,’ compare operand ’)’
not equal ::= NE ’(’ compare operand ’,’ compare operand ’)’
signed greater ::= GTI ’(’ compare operand ’,’ compare operand ’)’
unsigned greater ::= GTU ’(’ compare operand ’,’ compare operand ’)’
signed greater equal ::= GEI ’(’ compare operand ’,’ compare operand ’)’
160 Appendix A. Semantics Section
unsigned greater equal ::= GEU ’(’ compare operand ’,’ compare operand ’)’
signed less ::= LTI ’(’ compare operand ’,’ compare operand ’)’
unsigned less ::= LTU ’(’ compare operand ’,’ compare operand ’)’
signed less equal ::= LEI ’(’ compare operand ’,’ compare operand ’)’
unsigned less equal ::= LEU ’(’ compare operand ’,’ compare operand ’)’
compare operand ::= micro operation expression
| integer
| LISA declared item
| semantics related resources
non assignment statement ::= micro operator ’;’
| micro operation expression ’;’
| LISA declared item ’;’
| semantics related resources ’;’
| integer ’;’
micro operator ::= ADD | ADDC | SUB | SUBC | MULII | MULIU
| MULUU | AND | OR | XOR | NOT | NEG | LSL
| LSR | ROTL | ROTR | ASR | ZXT | SXT
| ’”’intrinsic name’”’
Micro-operation Expressions
micro operation expression ::= add expression | addc expression | sub expression
| subc expression | mulii expression | muliu expression
| muluu expression | and expression | or expression
| xor expression | not expression | neg expression
| lsl expression | lsr expression | rotl expression
| rotr expression | asr expression | zxt expression
| sxt expression | indir expression | intrinsic expression
| hierarchy expression
add expression ::= ADD [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
addc expression ::= ADDC [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
sub expression ::= SUBC [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
A.3. SEMANTICS Section Grammar 161
mulii expression ::= MULII [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
muliu expression ::= MULIU [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
muluu expression ::= MULUU [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
and expression ::= AND [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
or expression ::= OR [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
lsl expression ::= LSL [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
lsr expression ::= LSR [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
rotl expression ::= ROTL [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
rotr expression ::= ROTR [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
asr expression ::= ASR [ affected flags ] ’(’ operand ’,’ operand ’)’ [ bit specification ]
not expression ::= NOT [ affected flags ] ’(’ operand ’)’ [ bit specification ]
neg expression ::= NEG [ affected flags ] ’(’ operand ’)’ [ bit specification ]
zxt expression ::= ZXT [ affected flags ] ’(’ operand ’)’ bit specification
sxt expression ::= SXT [ affected flags ] ’(’ operand ’)’ bit specification
indir expression ::= INDIR [ affected flags ] ’(’ operand [’,’ endianess ] [’,’ bus name] ’)’
bit specification
endianess ::= LITTLE | BIG
bus name ::= identifier
intrinsic expression ::= ’ ” ’intrinsic name’ ” ’ [ affected flags ] ’(’ [operand (’,’ operand)*] ’)’
[ bit specification ]
162 Appendix A. Semantics Section
intrinsic name ::= ’ ’ identifier
hierarchy expression ::= LISA declared item [ affected flags ] ’(’ [operand (’,’ operand)*] ’)’
[ bit specification ]
operand ::= micro operation expression
| semantics related resources
| LISA declared item
| integer
affected flags ::= ’—’ flag ( ’,’ flag )* ’—’
flag ::= C | O | N | F
bit specification ::= ’<’ offset0 ’..’ offset1 ’>’
’<’ offset0 ’,’ width ’>’
offset0, offset1, width ::= integer | LISA declared item
Miscellaneous
semantics related resources ::= PC | SP | CF | OF | NF | ZF
LISA declared item ::= identifier [ bit specification ]
identifier ::= character ( character | figure | ’ ’ )+
integer ::= figure+
figure ::= ’0’ | ’1’ | ’2’ | ’3’ | ’4’ | ’5’ | ’6’ | ’7’ | ’8’ | ’9’
character ::= ’a’ | ’b’ | ’c’ | ’d’ | ’e’ | ’f’ | ’g’ | ’h’ | ’i’ | ’j’ | ’k’ | ’l’ | ’m’
| ’n’ | ’o’ | ’p’ | ’q’ | ’r’ | ’s’ | ’t’ | ’u’ | ’v’ | ’w’ | ’x’ | ’y’ | ’z’
| ’A’ | ’B’ | ’C’ | ’D’ | ’E’ | ’F’ | ’G’ | ’H’ | ’I’ | ’J’ | ’K’ | ’L’ | ’M’
| ’N’ | ’O’ | ’P’ | ’Q’ | ’R’ | ’S’ | ’T’ | ’U’ | ’V’ | ’W’ | ’X’ | ’Y’ | ’Z’
Appendix B
CoSy Compiler Library Grammar
This appendix contains the formal decription of the LISA CoSy compiler library grammar.
B.1 Grammar Notation
The keywords are denoted by using bolded font,
e.g. ’KEYWORD’.
The nonterminal symbols are typeset slanted,
e.g. ’nonterminal ’.
If the syntax definition contains special characters, they will be quoted with single quotes,
e.g. ’}’.
Concatenation of two components is denoted by putting the components in sequence,
e.g. concatenation ::= element1 element2
Optional components are denoted by surrounding square brackets,
e.g. optional ::= [ element ]
Repeating a component zero or more times is denoted with an asterisk,
e.g. repeat ::= element*
Repeating a component one or more times is denoted with an plus,
e.g. repeat ::= element+
163
164 Appendix B. CoSy Compiler Library Grammar
Alternative components are denoted by vertical bars,
e.g. alternative ::= option1 | option2 | option3
Brackets are used to group several elements,
e.g. elements ::= ( element1 element2 )
Several elements separated with comma can use the same definition,
e.g. element1, element2 ::= definition
B.2 Global Structure
compiler library ::= basic rules [semantics transformations]
| [basic rules ] semantics transformations
basic rules ::= rule category basic rule*
rule category ::= CATEGORY category
category ::= ARITHMETIC | CONVERT | LOADSTORE
| MOVE | CONTROL | SPILL | CALLING
B.3 Basic Rules
basic rule ::= cosy ir [ basic rule condition ] [ cosy condition ] [ nonterminal constraint ]
[ control clause ] [ readwrite clause ] [ scratch registers ] [ semantics pattern ]
[ result clause ] [ node assignment ]
B.3.1 CoSy IR
cosy ir ::= COSYIR mir source expression [ ’->’ mir destination expression ]
mir source expression, mir destination expression ::= ccmir expression
| nonterminal expression
nonterminal expression ::= nonterminal placeholder
| spill nonterminal
nonterminal placeholder ::= [ SIGNED ] [ UNSIGNED ] [ IMMEDIATE ] [ REGISTER ]
[ ADDRESS ] [ CONDITION ] [MEMORY] placeholder name
placeholder name ::= identifier
B.3. Basic Rules 165
spill nonterminal ::= Spill
ccmir expression ::= ccmir binary expression | ccmir unary expression | ccmir primary expression
ccmir binary expression ::= node name ’:’ binary node ’(’ mir operand ’,’ mir operand ’)’
binary node ::= mirPlus | mirMult | mirAnd | mirOr | mirXor | mirAddrPlus | mirDiv
| mirAddrDiff | mirDiff | mirShiftLeft | mirShiftRight | mirShiftRightSign
| mirAssign | mirCompare | mirReturn | mirMod | mirBitInsert
| mirBitExtract
ccmir unary expression ::= node name ’:’ unary node ’(’ mir operand ’)’
unary node ::= mirNot | mirNeg | mirConvert | mirContent | mirGoto | xirFuncCall
| mirCall | mirActual
ccmir primary expression ::= node name ’:’ primary node
primary node ::= mirObjectAddr | mirIntConst | mirNoExpr | mirAddrConst
| mirBoolConst | mirRealConst | mirNil
mir operand ::= ccmir expression
| nonterminal placeholder
node name ::= identifier
B.3.2 Rule Condition
basic rule condition ::= RULE COND rule conditions
rule conditions ::= type size compare((’||’ | ’&&’) type size compare )*
type size compare ::= type size ’==’ type size
| type size ’ !=’ type size
| type size ’>’ type size
| type size ’>=’ type size
| type size ’<’ type size
| type size ’<=’ type size
type size ::= ’SIZEOF’ ’(’ target C data type ’)’
| ’SIZEOF’ ’(’ LARGEST IMM NT ’)’
target C data type ::= CHAR | SHORT | INT | LONG | POINTER
166 Appendix B. CoSy Compiler Library Grammar
B.3.3 CoSy Condition
cosy condition ::= CONDITION ’{’ condition elements ’}’
condition elements ::= condition element ( ( ’||’ | ’&&’) condition element )*
condition element ::= [’ !’] condition name ’(’ condition operands ’)’
[’ !’] condition operand
[’ !’] ’(’ condition elements ’)’
condition name ::= identifier
condition operands ::= condition operand ( ’,’ condition operand )*
condition operand ::= node name
| node name ’.’ node attribute name
| nonterminal size
| type size
| integer
nonterminal size ::= ’SIZEOF’ ’(’ placeholder name ’)’
node attribute name ::= identifier
B.3.4 Nonterminal Constraint
nonterminal constraint ::= NONTERMINAL CONSTRAINT
constraint ( ( ’||’ | ’&&’) constraint )*
constraint ::= nonterminal name ’==’ nonterminal name
| nonterminal name ’ !=’ nonterminal name
| nonterminal size ’>=’ type size
| nonterminal size ’>’ type size
| nonterminal size ’==’ type size
nonterminal name ::= placeholder name
B.3.5 Control Clause
control clause ::= CONTROL control type
control type ::= call | branch | fallthrough
B.4. Semantics Transformations 167
B.3.6 Read/Write Clause
readwrite clause ::= read clause [ write clause ] | [ read clause ] write clause
read clause ::= READ MEMORY ’;’
write clause ::= WRITE MEMORY ’;’
B.3.7 Scratch Registers
scratch registers ::= SCRATCH scratch name ( ’,’ scratch name )* ’;’
scratch name ::= identifier
B.3.8 Semantics Pattern
semantics pattern ::= PATTERN’{’ compiler semantics ’}’
B.3.9 Node Assignment
node assignment ::= ASSIGNMENT ’{’ assignment+ ’}’
assignment ::= destination node expression ’=’ source node expression ’;’
destination node expression ::= node name ’.’ node attribute name
node attribute name ::= identifier
source node expression ::= node name [ ’.’ node attribute name ]
| integer
B.3.10 Result Clause
result clause ::= RESULT nonterminal name
B.4 Semantics Transformations
semantics transformations ::= Transformations transformation+
transformation ::= semantics transform
| transformation function
168 Appendix B. CoSy Compiler Library Grammar
semantics transform ::= ORIGINAL assignment statement
[scratch clause]
TRANSFORM ’{’ semantics statement+ ’}’
transformation function ::= TRANSFORATION ’(’ integer ( ’,’ nonterminal placeholder )* ’)’
[scratch clause]
’{’ semantics statement+ ’}’
B.5 Compiler Semantics
compiler semantics ::= semantic statement+
semantic statement ::= assignment statement
| if else statement
| non assignment statement
| label statement
B.5.1 Assignment Statement
assignment statement ::= source expression ’->’ destination expression ’;’
source expression ::= micro operation expression [ ’<’ offset ’,’ width ’>’]
| uop operands [ ’<’ offset ’,’ width ’>’]
| constant expression
destination expression ::= uop operandsAˆ´[ ’<’ offset ’,’ width ’>’]
| indir expression [ ’<’ offset ’,’ width ’>’]
B.5.2 Label Statement
label statement ::= label name ’:’ [’<’ label width ’>’]
label name ::= ”LLabel ” integer
label width ::= integer
B.5.3 IF-ELSE Statement
if else statement ::= IF ’(’ conditions ’)’ ’{’ assignment statement+ ’}’
[ ELSE ’{’ assignment statement+ ’}’ ]
| IF ’(’ conditions ’)’
CONSTANT ASSIGNMENT ’(’ nonterminal name ’)”;’
B.5. Compiler Semantics 169
conditions ::= condition ( (’||’ | ’&&’) condition )*
| ’(’ conditions ’)’
condition ::= equal | not equal | signed greater | unsigned greater
| signed greater equal | unsigned greater equal
| signed less | unsigned less
| signed less equal | unsigned less equal
| CF | OF | NF | ZF
equal ::= EQ ’(’ compare operand ’,’ compare operand ’)’
not equal ::= NE ’(’ compare operand ’,’ compare operand ’)’
signed greater ::= GTI ’(’ compare operand ’,’ compare operand ’)’
unsigned greater ::= GTU ’(’ compare operand ’,’ compare operand ’)’
signed greater equal ::= GEI ’(’ compare operand ’,’ compare operand ’)’
unsigned greater equal ::= GEU ’(’ compare operand ’,’ compare operand ’)’
signed less ::= LTI ’(’ compare operand ’,’ compare operand ’)’
unsigned less ::= LTU ’(’ compare operand ’,’ compare operand ’)’
signed less equal ::= LEI ’(’ compare operand ’,’ compare operand ’)’
unsigned less equal ::= LEU ’(’ compare operand ’,’ compare operand ’)’
compare operand ::= micro operation expression [ ’<’ offset ’,’ width ’>’]
| uop operands [ ’<’ offset ’,’ width ’>’]
| constant expression
B.5.4 Non-Assignment Statement
non assignment statement ::= NOP ’;’
| TRANSFORMATION ’(’ integer ( ’,’ transform operand )* ’)’
transform operand ::= micro operation expression [ ’<’ offset ’,’ width ’>’]
| uop operands [ ’<’ offset ’,’ width ’>’]
| constant expression
170 Appendix B. CoSy Compiler Library Grammar
B.5.5 Micro-Operation
micro operation expression ::= micro binary expressions
| micro unary expressions
| intrinsic expressions
micro binary expressions ::= binary operators [ affected flags ] ’(’ operand ’,’ operand ’)’
binary operators ::= ADD | ADDC | ASR | SUB | SUBC | MULII | MULIU
| MULUU | AND | OR | XOR | LSL | LSR | ROTL | ROTR
micro unary expressions ::= unary operators [ affected flags ] ’(’ operand ’)’
unary operators ::= NOT | NEG | SXT | ZXT | INDIR
intrinsic expression ::= ’ ” ’intrinsic name’ ” ’ [ affected flags ] ’(’ [operand (’,’ operand)*] ’)’
intrinsic name ::= ’ ’ identifier
operand ::= micro operation expression [ ’<’ offset ’,’ width ’>’]
| uop operands [ ’<’ offset ’,’ width ’>’]
| constant expression
B.5.6 Operands
uop operands ::= REGISTER PC | FP | SP | CF | OF | NF | ZF
| nonterminal name ’.’ nonterminal attribute name
| nonterminal name
| nonterminal placeholder
| SYMBOL ’(’ symbol name ’)’
| label name
symbol name ::= ’ ’ identifier
constant expression ::= nonterminal size
| calculation
calculation ::= calculation operand ( ( ’+’ | ’-’ | ’*’ | ’ˆ ’ ) calculation operand )*
calculation operand ::= integer
| type size
| ’(’ calculation ’)’
B.6. Miscellaneous 171
offset , width ::= constant expression
affected flags ::= ’|’ flag ( ’,’ flag )* ’|’
flag ::= C | O | N | F
B.6 Miscellaneous
identifier ::= character ( character | figure | ’ ’ )+
integer ::= figure+
figure ::= ’0’ | ’1’ | ’2’ | ’3’ | ’4’ | ’5’ | ’6’ | ’7’ | ’8’ | ’9’
character ::= ’a’ | ’b’ | ’c’ | ’d’ | ’e’ | ’f’ | ’g’ | ’h’ | ’i’ | ’j’ | ’k’ | ’l’ | ’m’
| ’n’ | ’o’ | ’p’ | ’q’ | ’r’ | ’s’ | ’t’ | ’u’ | ’v’ | ’w’ | ’x’ | ’y’ | ’z’
| ’A’ | ’B’ | ’C’ | ’D’ | ’E’ | ’F’ | ’G’ | ’H’ | ’I’ | ’J’ | ’K’ | ’L’ | ’M’
| ’N’ | ’O’ | ’P’ | ’Q’ | ’R’ | ’S’ | ’T’ | ’U’ | ’V’ | ’W’ | ’X’ | ’Y’ | ’Z’
Bibliography
[1] A. Aho and R. Sethi and J. Ullman. Compilers, Principles, Techniques and Tools. Addison-
Wesley, Jan. 1986. ISBN 0-2011-0088-6.
[2] A. Aho, M. Ganapathi, and S. Tjiang. Code generation using tree matching and dynamic
programming. ACM Transactions on Programming Languages and Systems, 11(4):491–516,
1989.
[3] A. Appel. Modern Compiler Implementation in C. Cambridge University Press, Jan. 1998.
ISBN 0-5215-8390-X.
[4] A. Appel, J. Davidson, and N. Ramsey. The Zephyr Compiler Infrastructure. Internal
report, University of Virginia, 1998. http://www.cs.virginia.edu/zephyr.
[5] A. Eichenberger, P. Wu and K. O’Brien. Vectorization for SIMD architectures with align-
ment constraints. In Proc. of the Int. Conf. on Programming Language Design and Imple-
mentation (PLDI), pages 82–93, 2004.
[6] A. Fauth. Beyond tool-Specific Machine Descriptions. In P. Marwedel and G. Goosens,
editors, Code Generation for Embedded Processors. Kluwer Academic Publishers, 1995.
[7] A. Fauth and A. Knoll. Automated Generation of DSP Program Development Tools Using a
Machine Description Formalism. In Proc. of the Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), 1993.
[8] A. Fauth, J. Van Praet, and M. Freericks. Describing Instruction Set Processors Using nML.
In Proc. of the European Design and Test Conference (ED & TC), Mar. 1995.
[9] A. Gavrylenko. An Optimized Linear Scan Register Allocator for a Retargetable C-Compiler.
Master thesis, Software for Systems on Silicon, RWTH Aachen University, 2006. Advisor:
M. Hohenauer.
173
174 Bibliography
[10] A. Halambi, A. Shrivastava, N. Dutt, and A. Nicolau. A customizable compiler framework
for embedded systems. In Proc. of the Workshop on Software and Compilers for Embedded
Systems (SCOPES), Mar. 2001.
[11] A. Halambi, P. Grun, H. Tomiyama, N. Dutt, and A. Nicolau. Automatic Software Toolkit
Generation for Embedded System-on-Chip. In Proc. of the International Conference on
Visual Computing, Feb. 1999.
[12] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau. EXPRESSION:
A Language for Architecture Exploration through Compiler/Simulator Retargetability. In
Proc. of the Conference on Design, Automation & Test in Europe (DATE), Mar. 1999.
[13] A. Hoffmann, R. Leupers, and H. Meyr. Architecture Exploration for Embedded Processors
with LISA. Kluwer Academic Publishers, Boston, Jan. 2003. ISBN 1-4020-7338-0.
[14] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, and H.
Meyr. A Novel Methodology for the Design of Application Specific Instruction Set Processors
(ASIP) Using a Machine Description Language. IEEE Transactions on Computer-Aided
Design, 20(11):1338–1354, Nov. 2001.
[15] A. Hoffmann, T. Kogel, and H. Meyr. A Framework for Fast Hardware-Software Co-
simulation. In Proc. of the Conference on Design, Automation & Test in Europe (DATE),
Mar. 2001.
[16] A. Hoffmann,A. Nohl, G. Braun, and H. Meyr. Generating Production Quality Software
Development Tools Using A Machine Description Language. In Proc. of the Conference on
Design, Automation & Test in Europe (DATE), Mar. 2001.
[17] A. Inoue, H. Tomiyama, E.F. Nurprasetyo, and H. Yasuura. A Programming Language
for Processor Based Embedded Systems. In Proc. of the Asia Pacific Conference on Chip
Design Language (APCHDL), 1999.
[18] A. Inoue, H. Tomiyama, H. Okuma, H. Kanbara, and H. Yasuura. Language and compiler for
optimizing datapath widths of embedded systems. IEICE Transactions on Fundamentals,
12(E81-A):2595–2604, Dec. 1998.
[19] A. Jones, D. Bagchi, S. Pal, P. Banerjee, and A. Choudhary. PACT HDL: a compiler
targeting ASICS and FPGAS with power and performance optimizations. pages 169–190,
2002.
[20] A. Khare. SIMPRESS: A Simulator Generation Environment for System-on-Chip Explo-
ration. Technical report, Department of Information and Computer Science, University of
California, Irvine, Sep. 1999.
Bibliography 175
[21] A. Kitajima, M. Itoh, J. Sato, A. Shiomi, Y. Takeuchi, and M. Imai. Effectiveness of the
ASIP Design System PEAS-III in Design of Pipelined Processors. In Proc. of the Asia South
Pacific Design Automation Conference (ASPDAC), Jan. 2001.
[22] A. Krall and S. Lelait. Compilation Techniques for Multimedia Processors. Int. J. Parallel
Program., 28(4):347–361, 2000.
[23] A. Krall, I. Pryanishnikov, U. Hirnschrott, and C. Panis. xDSPcore: A Compiler-Based
Configurable Digital Signal Processor. IEEE Micro, 24(4):67–78, 2004.
[24] A. Kudriavtsev and P. Kogge. Generation of permutations for SIMD processors. In Proc.
of the Int. Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES).
[25] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, and H. Meyr. A Universal Technique for Fast
and Flexible Instruction-Set Architecture Simulation. In Proc. of the Design Automation
Conference (DAC), Jun. 2002.
[26] A. Oraioglu and A. Veidenbaum. Application Specific Microprocessors (Guest Editors’
Introduction). In IEEE Design & Test of Computers, Jan 2003.
[27] A. Terechko, E. Pol, and J. van Eijndhoven. PRMDL: a Machine Description Language for
Clustered VLIW Architectures. In Proc. of the Conference on Design, Automation & Test
in Europe (DATE), Mar. 2001.
[28] ACE – Associated Compiler Experts. CoSy System Documentation parts 1 to 5, 2005.
[29] ACE – Associated Computer Experts bv. SuperTest - Compiler Test and Validation Suite
http://www.ace.nl.
[30] ACE – Associated Computer Experts bv. The COSY Compiler Development System
http://www.ace.nl.
[31] Adelante Technologies. AR|T Builder
http://www.adelantetechnologies.com.
[32] Advanced Risc Machines Ltd. http://www.arm.com.
[33] Advanced Risc Machines Ltd. ARM9 and ARM11 Data Sheet, Dec. 1996.
[34] Analog Devices Inc. Analog Devices Homepage
http://www.analog.com.
[35] ARC International. ARCtangent Processor
http://www.arc.com.
176 Bibliography
[36] ARC International. ARC Programmers Reference Manual, Dec. 1999.
[37] B. Kerninghan and D. Ritchie. The C Programming Language. Prentice Hall Software Series,
1988.
[38] B. Moszkowski and Z. Manna. Reasoning in interval temporal logic. In Logics of Programs:
Proceedings of the 1983 Workshop, pages 371–381. Springer-Verlag, 1984.
[39] B. Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In Proc. of
the International Symposium on Microarchitecture (MICRO), pages 63–74, New York, NY,
USA, 1994. ACM Press.
[40] B. Rau. VLIW Compilation driven by a machine description database. In Proc. of the 2nd
Code Generation Workshop, Leuven, Belgium, 1996.
[41] C. Fraser. A compact, machine-independent peephole optimizer. In Principles of Program-
ming Languages (POPL), pages 1–6, 1979.
[42] C. Fraser and D. Hanson. A Retargetable C Compiler : Design and Implementation. Ben-
jamin/Cummings Publishing Co., 1994.
[43] C. Fraser, D. Hanson and T. Proebsting. Engineering a simple, efficient code-generator
generator. ACM Letters on Programming Languages and Systems, 1(3):213–226, 1992.
[44] C. Fraser, R. Henry and T. Proebsting. BURG — fast optimal instruction selection and
tree parsing. ACM SIGPLAN Notices, 27(4):68–76, Apr. 1992.
[45] C. Liem, F. Breant, S. Jadhav, R. O’Farrell, R. Ryan, and O. Levia. Embedded Tools for
a Configurable and Customizable DSP Architecture. IEEE Design & Test of Computers,
19(6):27–35, 2002.
[46] C. Schumacher. Retargetable SIMD Optimization for Architectures with Multimedia In-
struction Sets. Diploma thesis, Software for Systems on Silicon, RWTH Aachen University,
2005. Advisor: M. Hohenauer.
[47] C. Siska. A Processor Description Language Supporting Retargetable Multi-Pipeline DSP
Program Development Tools. In Proc. of the Int. Symposium on System Synthesis (ISSS),
Dec. 1998.
[48] Center for Reliable and High-Performance Computing,University of Illinois. Illi-
nois Microarchitecture Project utilizing Advanced Compiler Technology (IMPACT).
http://www.crhc.uiuc.edu/IMPACT.
[49] Chunho Lee. MediaBench benchmark suite. http://euler.slu.edu/ fritts/mediabench/mb1.
Bibliography 177
[50] Coware Inc. Processor Designer
http://www.coware.com.
[51] D. August. Hyperblock performance optimizations for ILP processors. M.S. thesis, Dept. of
Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1996. M.S. thesis.
[52] D. August,W. Hwu and S. Mahlke. A framework for balancing control flow and predication.
In Proc. of the International Symposium on Microarchitecture (MICRO), 1997.
[53] D. Bradlee, R. Henry, and S. Eggers. The Marion System for Retargetable Instruction
Scheduling. In Proc. of the Int. Conf. on Programming Language Design and Implementation
(PLDI), pages 229–240, 1991.
[54] D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Computer Architecture
News, 25(3):13–25, Jun. 1997.
[55] D. Fischer, J. Teich, M. Thies, and R. Weper. Efficient architecture/compiler co-exploration
for ASIPs. In Proc. of the Conference on Compilers, Architectures and Synthesis for Em-
bedded Systems (CASES), pages 27–34, 2002.
[56] D. Fischer, J. Teich, R. Weper, U. Kastens, and M. Thies. Design space characterization for
architecture/compiler co-exploration. In Proc. of the Conference on Compilers, Architectures
and Synthesis for Embedded Systems (CASES), pages 108–115, 2001.
[57] D. Genin, E Hilfinger, J. Rabaey, C. Scheers, and H. De Man. DSP Specification using the
SILAGE language. In Proc. of the Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), pages 1057–1060, 1990.
[58] D. Ka¨stner. Propan: A retargetable system for postpass optimisations and analyses. In
LCTES ’00: Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and
Tools for Embedded Systems, pages 63–80, 2001.
[59] D. Knuth. Semantics of context-free languages. Theory of Computing Systems, 2(2), June
1968.
[60] D. Landskov, S. Davidson, B. Shriver, and P. Mallett. Local microcode compaction tech-
niques. ACM Computing Surveys., 12(3):261–294, 1980.
[61] D. Lanner, J. Van Praet, A. Kifli, K. Schoofs, W. Geurts, F. Thoen, and G. Goossens.
Chess: Retargetable Code Generation for Embedded DSP Processors. In P. Marwedel and
G. Goosens, editors, Code Generation for Embedded Processors. Kluwer Academic Publish-
ers, 1995.
178 Bibliography
[62] D. Maufroid, P. Paolucci et al. mAgic FPU: VLIW floating point engines for System-On-
Chip applications. In Proc. Emmsec Conference, 1999.
[63] D. Nuzman, I. Rosen and A. Zaks. Auto-vectorization of interleaved data for SIMD. In
Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI),
pages 132–143, 2006.
[64] Digital Equipment Corporation, Maynard, MA. Digital Semiconductor SA-110 Micropro-
cessor Technical Reference Manual, 1996.
[65] E. Dashofy, A. van der Hoek, and R. Taylor. A Highly-Extensible, XML-Based Architecture
Description Language. In WICSA ’01: Proceedings of the Working IEEE/IFIP Conference
on Software Architecture (WICSA’01), page 103, 2001.
[66] E. Rohou, F. Bodin, A. Seznec, G. Fol, F. Charot, and F. Raimbault. SALTO : System for
Assembly-Language Transformation and Optimization. Technical report, INRIA, national
institute for research in computer science and control, 1996.
[67] Edison Design Group. Compiler Front Ends for the OEM Market. http://www.edg.com.
[68] Embedded-C. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1169.pdf.
[69] F. Berens, G. Kreiselmaier, and N. Wehn. Channel Decoder Architecture for 3G Mobile
Wireless Terminals. In Proc. of the Conference on Design, Automation & Test in Europe
(DATE), 2004.
[70] F. Brandner, D. Ebner, and A. Krall. Compiler generation from structural architecture
descriptions. In Proc. of the Conference on Compilers, Architectures and Synthesis for
Embedded Systems (CASES), pages 13–22, 2007.
[71] F. Chow and J. Hennessy. Register allocation by priority-based coloring. ACM Letters on
Programming Languages and Systems, 19(6), 1984.
[72] F. Chow and J. Hennessy. The priority-based coloring approach to register allocation. ACM
Transactions on Programming Languages and Systems, 12(4):501–536, Oct. 1990.
[73] F. Engel. Interprocedural Pointer Alignment Analysis for a Retargetable C-Compiler with
SIMD Optimization. Diploma thesis, Software for Systems on Silicon, RWTH Aachen Uni-
versity, 2006. Advisor: M. Hohenauer.
[74] F. Franchetti, S. Kral, J. and C. Ueberhuber. Efficient utilization of SIMD extensions. In
Proceedings of the IEEE, volume 93, pages 409–425, 2005.
[75] F. Homewood and P. Faraboschi. ST200: A VLIW Architecture for Media-Oriented Appli-
cations. In Microprocessor Forum, Oct. 2000.
Bibliography 179
[76] F. Yang. ESP: A 10 year retrospective. In Proc. of the Embedded Systems Programming
Conference, 1999.
[77] Free Software Foundation. Auto-vectorization in GCC, 2004 .
[78] Free Software Foundation. GNU Compiler Collection Homepage
http://gcc.gnu.org.
[79] G. Amdahl. Validity of the single-processor approach to achieving large-scale computer
capabilities. In AFIPS Conference Proc., volume 30, page 483, 1967.
[80] G. Bette. Retargetable Conditional Execution Support for CoSy Compilers. Diploma thesis,
Software for Systems on Silicon, RWTH Aachen University, 2007. Advisor: M. Hohenauer.
[81] G. Braun, A. Nohl, W. Sheng, J. Ceng, M. Hohenauer, H. Scharwa¨chter, R. Leupers and
H. Meyr. A novel approach for flexible and consistent ADL-driven ASIP design. In Proc. of
the Design Automation Conference (DAC), pages 717–722, 2004.
[82] G. Chaitin. Register allocation and spilling via graph coloring. ACM SIGPLAN Notices,
17(6):98–105, Jun. 1982.
[83] G. Chaitin and M. Auslander and A.Chandra and J. Cocke and M. Hopkins and P. Mark-
stein. Register allocation via coloring. Proc. of the International Conference on Computer
Languages (ICCL), 6(1):47–57, Jan. 1981.
[84] G. Cheong and M. Lam. An Optimizer for Multimedia Instruction Sets. In Proceedings of
the Second SUIF Compiler Workshop, Stanford University, USA, 1997.
[85] G. Hadjiyiannis, P. Russo, and S. Devadas. A Methodology for Accurate Performance
Evaluation in Architecture Exploration. In Proc. of the Design Automation Conference
(DAC), Jun. 1999.
[86] G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL: An Instruction Set Description Language
for Retargetability. In Proc. of the Design Automation Conference (DAC), Jun. 1997.
[87] G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL Language Reference Manual, Jan. 1997.
[88] G. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), 1965.
[89] G. Ren, P. Wu and D. Padua. A Preliminary Study on the Vectorization of Multimedia
Applications for Multimedia Extensions. In 16th International Workshop of Languages and
Compilers for Parallel Computing, October 2003.
180 Bibliography
[90] G. Ren, P. Wu and D. Padua. Optimizing data permutations for SIMD devices. In Proc.
of the Int. Conf. on Programming Language Design and Implementation (PLDI), pages
118–131, 2006.
[91] G. Smith. Crisis of complexity. In Gartner Dataquest briefing, 40th Design Automation
Conference (DAC), Jun. 2003.
[92] Gigascale Systems Research Center. Modern Embedded Systems: Compilers, Architectures,
and Languages. http://www.gigascale.org/mescal.
[93] GNU – Free Software Foundation. Bison - GNU Project
http://www.gnu.org/software/bison/bison.html.
[94] GNU – Free Software Foundation. Flex - GNU Project
http://www.gnu.org/software/flex/flex.html.
[95] H. Akaboshi. A Study on Design Support for Computer Architecture Design. PhD thesis,
Department of Information Systems, Kyushu University, Jan. 1996.
[96] H. Emmelmann, F. Schro¨er, and R. Landwehr. BEG — a generator for efficient back ends.
Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI),
24(7):227–237, Jul. 1989.
[97] H. Scharwaechter, D. Kammler, A. Wieferink, M. Hohenauer, K. Karuri,J. Ceng, R. Leupers,
G. Ascheid, and H. Meyr. ASIP architecture exploration for efficient IPSec encryption: A
case study. Trans. on Embedded Computing Sys., 6(2), 2007.
[98] H. Scharwaechter, R. Leupers, G. Ascheid, H. Meyr, J. Youn, and Y. Paek. A code-generator
generator for multi-output instructions. In Proc. of the Int. Conference on Hardware/Soft-
ware Co-design and System Synthesis (CODES+ISSS), Sep. 2007.
[99] H. Tomiyama, A. Halambi, P. Grun, N. Dutt, and A. Nicolau. Architecture Description
Languages for System-on-Chip Design. In Proc. of the Asia Pacific Conference on Chip
Design Language (APCHDL), Oct. 1999.
[100] H. Walters, J. Kamperman, and K. Dinesh. An extensible language for the generation of
parallel data manipulation and control packages, 1994.
[101] Hewlett-Packard. PA-RISC 1.1 Architecture and Instruction-set Reference Manual (Third
Edition), 1994.
[102] I. Huang and P. Xie. Application of instruction analysis/synthesis tools to x86’s functional
unit allocation. In Proc. of the Int. Symposium on System Synthesis (ISSS), Dec. 1998.
Bibliography 181
[103] I. Huang, B. Holmer, and A. Despain. ASIA: Automatic Synthesis of Instruction-set Archi-
tectures. In Proc. of the SASIMI Workshop, Oct. 1993.
[104] I. Pryanishnikov, A. Krall and N. Horspool. Pointer Alignment Analysis for Processors with
SIMD Instructions. In Proc. 5th Workshop on Media and Streaming Processors, 2003.
[105] IMEC. http://www.imec.be.
[106] Institute of Electrical and Electronics Engineers, Inc. (IEEE). IEEE Standard for Verilog
Hardware Description Language 2001.
[107] Institute of Electrical and Electronics Engineers, Inc. (IEEE). IEEE Standard VHDL Lan-
guage Reference Manual 2000.
[108] Intel Corporation. Intel C Compiler, http://www.intel.com.
[109] International Technology Roadmap for Semiconductors. SoC Design Cost Model - 2003
http://www.itrs.net.
[110] ISS RWTH Aachen University. The DSPstone benchmark suite. http://www.ert.rwth-
aachen.de/Projekte/Tools/DSPSTONE.
[111] J. Allen, K. Kennedy, C. Porterfield and J. Warren. Conversion of control dependence to
data dependence. In Principles of Programming Languages (POPL), 1983.
[112] J. Ceng, W. Sheng, M. Hohenauer, R. Leupers, G. Ascheid, H. Meyr, and G. Braun. Mod-
eling Instruction Semantics in ADL Processor Descriptions for C Compiler Retargeting.
Journal of VLSI Signal Processing Systems, 43(2-3):235–246, 2006.
[113] J. Davidson and C. Fraser. The design and application of a retargetable peephole optimizer.
ACM Transactions on Programming Languages and Systems, 2(2):191–202, 1980.
[114] J. Davidson and C. Fraser. Automatic generation of peephole optimizations. In Proc. of the
SIGPLAN Symposium on Compiler Construction, pages 111–116, 1984.
[115] J. Fisher. Trace scheduling: a technique for global microcode compaction. IEEE Transac-
tions on Computers, C-30(7):478–490, Jul. 1981.
[116] J. Fisher. Customized instruction-sets for embedded processors. In Proc. of the Design
Automation Conference (DAC), pages 253–257, 1999.
[117] J. Fisher, P. Faraboschi, and C. Young. Embedded Computing : A VLIW Approach to
Architecture, Compilers and Tools. Morgan Kaufmann, December 2004.
[118] J. Gyllenhaal, B. Rau and W. Hwu. Hmdes Version 2.0 Specification. Technical report,
IMPACT Research Group, Univ. of Illinois, 1996.
182 Bibliography
[119] J. Gyllenhaal, W. Hwu, and B. Rau. Optimization of Machine Descriptions for Efficient
Use. International Journal of Parallel Programming, Aug. 1998.
[120] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan
Kaufmann Publishers Inc., 1996. Second Edition.
[121] J. Paakki. Attribute Grammar Paradigms – A High-Level Methodology in Language Imple-
mentation. ACM Computing Surveys, 27(2), June 1995.
[122] J. Sato, A. Y. Alomary, Y. Honma, T. Nakata, A. Shiomi, N. Hikichi, and M. Imai. PEAS-
I: A Hardware/software Co-design System for ASIP Development. IEICE Transactions on
Fundamentals of Electronics, Communications and Computer Sciences, E77-A(3):483–491,
Mar. 1994.
[123] J. Sato, M. Imai, T. Hakata, A. Alomary, and N. Hikichi. An Integrated Design Environment
for Application-Specific Integrated Processors. In Proc. of the Int. Conf. on Computer
Design (ICCD), Mar. 1991.
[124] J. van Praet, G. Goossens, D. Lanner, and H. De Man. Instruction Set Definition and
Instruction Selection for ASIPs. In Proc. of the Int. Symposium on System Synthesis (ISSS),
Oct. 1994.
[125] J. Wagner and R. Leupers. Advanced Code Generation for Network Processors with Bit
Packet Addressing. In Proc. of the Workshop on Network Processors (NP1), Feb. 2002.
[126] K. Bischoff. Design, Implementation, Use, and Evaluation of Ox: An Attribute-Grammar
Compiling System based on Yacc, Lex, and C. Technical Report 92-31, Department of
Computer Science, Iowa State University, Irvine, 1992.
[127] K. Cooper, T. Harvey, and T. Waterman. Building a Control-Flow Graph from Scheduled
Assembly Code. Technical report, Dept. of Computer Science, Rice University.
[128] K. Diefendorff and P. Dubey. How Multimedia Workloads Will Change Processor Design.
Computer, 30(9):43–45, 1997.
[129] K. Hazelwood and T. Conte. A lightweight algorithm for dynamic if-conversion during
dynamic optimization. In Proc. of the Int. Conf. on Parallel Architectures and Compilation
Techniques (PACT), 2000.
[130] K. Karuri, M. Al Faruque, S. Kraemer, R. Leupers, G. Ascheid, and H. Meyr. Fine-grained
application source code profiling for ASIP design. In Proc. of the Design Automation Con-
ference (DAC), pages 329–334, 2005.
Bibliography 183
[131] K. Kreuzer, S. Malik, A. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli. System-Level
Design: Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on
Computer-Aided Design, 19(12):1523–1543, Dec. 2000.
[132] K. Olukotun, M. Heinrich, and D. Ofelt. Digital System Simulation: Methodologies and
Examples. In Proc. of the Design Automation Conference (DAC), Jun. 1998.
[133] K. Sagonas and E. Stenman. Experimental evaluation and improvements to linear scan
register allocation. Software, Practice and Experience, 33(11):1003–1034, 2003.
[134] K. van Berkel, F. Heinle, P. Meuwissen, K. Moerman, and M. Weiss. Processing as an
enabler for software-defined radio in handheld devices. In EURASIP Journal on Applied
Signal Processing, 2005.
[135] L. Chunho, M. Potkonjak, and W. Mangione-Smith. MediaBench: a tool for evaluating
and synthesizing multimedia and communications systems. In Proc. of the International
Symposium on Microarchitecture (MICRO), 1997.
[136] L. Guerra et al. Cycle and Phase Accurate DSP Modeling and Integration for HW/SW
Co-Verification. In Proc. of the Design Automation Conference (DAC), Jun. 1999.
[137] L.Carter, B. Simon, B. Calder, and J. Ferrante. Path Analysis and Renaming for Predicated
Instruction Scheduling. In International Journal of Parallel Programming, 2000.
[138] M. Bailey and J. Davidson. A Formal Model and Specification Language for Procedure
Calling Conventions.
[139] M. Barbacci. Instruction set processor specifications (ISPS): The notations and its applica-
tion. In IEEE Trans. Comput., pages 24–40, 1981.
[140] M. Benitez and J. Davidson. Target-specific global code improvement: Principles and ap-
plications. Technical report, Charlottesville, VA, USA, 1994.
[141] M. Ertl. Optimal Code Selection in DAGs . In Principles of Programming Languages
(POPL), 1999.
[142] M. Freericks. The nML Machine Description Formalism. Technical Report, Technical Uni-
versity of Berlin, Department of Computer Science, 1993.
[143] M. Freericks, A. Fauth and A. Knoll. Implementation of Complexy DSP Systems Using
High-Level Design Tools. In Signal Processing VI: Theories and Applications, 1994.
[144] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-
Completeness. W.H. Freeman & Co, 1979. ISBN 0-7167-1045-5.
184 Bibliography
[145] M. Gries and K. Keutzer. Building ASIPs: The Mescal Methodology. Springer-Verlag, 2005.
[146] M. Hartoog, J. Rowson, P. Reddy, S. Desai, D. Dunlop, E. Harcourt, and N. Khullar.
Generation of Software Tools from Processor Descriptions for Hardware/Software Codesign.
In Proc. of the Design Automation Conference (DAC), Jun. 1997.
[147] M. Hohenauer, C. Schumacher, R. Leupers, G. Ascheid, H. Meyr, H. v. Someren. Re-
targetable code optimization with SIMD instructions. In Proc. of the Int. Conference on
Hardware/Software Co-design and System Synthesis (CODES+ISSS), pages 148–153, 2006.
[148] M. Hohenauer, F.Engel, R. Leupers, G. Ascheid, H. Meyr, G. Bette, and B. Singh. Retar-
getable Code Optimization for Predicated Execution. In Proc. of the Conference on Design,
Automation & Test in Europe (DATE), 2008.
[149] M. Hohenauer, H. Scharwaechter, K. Karuri, O. Wahlen, T. Kogel, R. Leupers, G. Ascheid,
H. Meyr, G. Braun, and H. v. Someren. A Methodology and Tool Suite for C Compiler
Generation from ADL Processor Models. In Proc. of the Conference on Design, Automation
& Test in Europe (DATE), page 21276, 2004.
[150] M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima and M. Imai. PEAS-III:
An ASIP Design Environment. In Proc. of the Int. Conf. on Computer Design (ICCD), Sep.
2000.
[151] M. Itoh, Y. Takeuchi, M. Imai, and A. Shiomi. Synthesizable HDL Generation for Pipelined
Processors from a Micro-Operation Description. IEICE Transactions on Fundamentals of
Electronics, Communications and Computer Sciences, E83-A(3), Mar. 2000.
[152] M. Jain, M. Balakrishnan, and A. Kumar. ASIP Design Methodologies: Survey and Issues.
In Int. Conf. on VLSI Design, Jan. 2001.
[153] M. Kuulusa, J. Nurmi, J. Takala, P. Ojala, and H. Herranen. A Flexible DSP Core for
Embedded Systems. IEEE Design & Test of Computers, 14(4):60–68, 1997.
[154] M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. Proc.
of the Int. Conf. on Programming Language Design and Implementation (PLDI), 23(7):318–
328, Jun. 1988.
[155] M. Poletto and V. Sarkar. Linear scan register allocation. ACM Transactions on Program-
ming Languages and Systems, 21(5):895–913, 1999.
[156] M. Smelyanskiy and S. Mahlke and E. Davidson and H. Lee. Predicate-aware scheduling:
a technique for reducing resource constraints. In Proc. of the Int. Conf. on Programming
Language Design and Implementation (PLDI), 2003.
Bibliography 185
[157] M. Vachharajani, N. Vachharajani, and D. August. The liberty structural specification
language: a high-level modeling language for component reuse. SIGPLAN Not., 39(6):195–
206, 2004.
[158] M. Vachharajani, N. Vachharajani, D. Penry, J. Blome, and D. August. Microarchitectural
exploration with Liberty. In Proc. of the International Symposium on Microarchitecture
(MICRO), pages 271–282. IEEE Computer Society Press, 2002.
[159] M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1995.
[160] MIPS Technologies Inc. MIPS Homepage
http://www.mips.com.
[161] MIPS technologies Inc. MIPS 4Kc Processor Core Datasheet, Jun. 2000.
[162] M.Naberezny. 6502 Homepage
http://www.6502.org.
[163] Motorola. DSP56K Manual, 1998.
[164] Motorola Inc. MPC750 RISC Microprocessor User’s Manual, 1997.
[165] N. Ramsey and J.W. Davidson. Machine Descriptions to Build Tools for Embedded Systems.
In Workshop on Languages, Compilers, and Tools for Embedded Systems, 1998.
[166] N. Ramsey and M. Fernandez. Specifying Representations of Machine Instructions. IEEE
Transactions on Programming Languages and Systems, 19(3), Mar. 1997.
[167] N. Rizzolo and D. Padua. HiLO: High Level Optimization of FFTs. In Languages and
Compilers for High Performance Computing, volume 3602, 2005.
[168] N. Warter, D. Lavery, and W. Hwu. The benefit of predicated execution for software
pipelining. In Proceedings of the 26th Hawaii International Conference on System Sciences,
1993.
[169] NXP Semiconductors. Nexperia PNX 1500 family and TriMedia media processors.
http://www.nxp.com.
[170] O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, and H. Meyr. Architecture Implementation
Using the Machine Description Language LISA. In Proc. of the Asia South Pacific Design
Automation Conference (ASPDAC), page 239, 2002.
186 Bibliography
[171] O. Wahlen. C Compiler Aided Design of Application-Specific Instruction-Set Processors
Using the Machine Description Language LISA. PhD thesis, Institute for Integrated Signal
Processing Systems, RWTH Aachen University, Aachen, 2003.
[172] O. Wahlen, M. Hohenauer, R. Leupers, and H. Meyr. Instruction Scheduler Generation for
Retargetable Compilation. IEEE Design & Test of Computers, 20(1):34–41, 2003.
[173] Oberhumer.com GmbH. Lightweight Lempel-Ziv-Oberhumer (LZO), a lossless data com-
pression library. http://www.oberhumer.com/opensource/lzo.
[174] On Demand Microelectronics. http://www.ondemand.co.at.
[175] P. Anklam, Cutler, Heinen, and MacLaren. Engineering a Compiler: VAX-11 Code Gener-
ation and Optimization. Butterworth-Heinemann, Newton, MA, USA, 1982.
[176] P. Briggs and K. Cooper and L. Torczon. Improvements to Graph Coloring Register Allo-
cation. IEEE Transactions on Programming Languages and Systems, 16(3):428–455, May
1994.
[177] P. Chang, S. Mahlke, W. Chen, N. Warter, and W. Hwu. IMPACT: An Architectural
Framework for Multiple-Instruction-Issue Processors. ACM Computer Architecture News,
SIGARCH, 19(3):266–275, 1991.
[178] P. Grun, A. Halambi, A. Khare, V. Ganesh, N. Dutt, and A. Nicolau. EXPRESSION:
An ADL for System Level Design Exploration. Technical Report 98-29, Department of
Information and Computer Science, University of California, Irvine, Sep. 1998.
[179] P. Grun, A. Halambi, N. Dutt, and A. Nicolau. RTGEN: An Algorithm for Automatic
Generation of Reservation Tables from Architectural Descriptions. In Proc. of the Int.
Symposium on System Synthesis (ISSS), page 44, 1999.
[180] P. Marwedel and W. Schenk. Cooperation of synthesis, retargetable code generation and
test generation in the MSS. In Proc. of the Conference on Design, Automation & Test in
Europe (DATE).
[181] P. Mishra, N. Dutt and A. Nicolau. Functional abstraction driven design space exploration
of heterogeneous programmable architectures. In Proc. of the Int. Symposium on System
Synthesis (ISSS), pages 256–261, 2001.
[182] P. Paolucci, P. Kajfaszc, P. Bonnotc, B. Candaelec, D. Maufroidc, E. Pastorellia, A. Ricciar-
dia, Y. Fusellad, and E. Guarino . mAgic-FPU and MADE: A customizable VLIW core and
the modular VLIW processor architecture description environment. In Computer Physics
Communications, volume 139, pages 132–143, 2001.
Bibliography 187
[183] P. Paulin. Towards Application-Specific Architecture Platforms: Embdedded Systems De-
sign Automation Technologies. In Proc. of the EuroMicro, Apr. 2000.
[184] P. Paulin. Design Automation Challenges for Application-Specific Architecture Platforms.
Keynote speech at SCOPES 2001 - Workshop on Software and Compilers for Embedded
Systems (SCOPES), Apr. 2001.
[185] P. Paulin and M. Santana. Flexware: A retargetable embedded-software development envi-
ronment. IEEE Des. Test, 19(4):59–69, 2002.
[186] P. Paulin, C. Liem, T.C. May, and S. Sutarwala. FlexWare: A Flexible Firmware Develop-
ment Environment for Embedded Systems. In P. Marwedel and G. Goosens, editors, Code
Generation for Embedded Processors. Kluwer Academic Publishers, 1995.
[187] P. Paulin, F. Karim, and P. Bromley. Network Processors: A Perspective on Market Re-
quirements, Processor Architectures and Embedded SW Tools. In Proc. of the Conference
on Design, Automation & Test in Europe (DATE), Mar. 2001.
[188] P. Wu, A. Eichenberger and A. Wang. Efficient SIMD Code Generation for Runtime Align-
ment and Length Conversion. In Proc. of the int. symposium on Code generation and
optimization (CGO), pages 153–164, 2005.
[189] R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form.
ACM Transactions on Programming Languages and Systems, 9(4):491–542, 1987.
[190] R. Allen, K. Kennedy and J. Allen. Optimizing Compilers for Modern Architectures: A
Dependence-based Approach. Morgan Kaufmann Publishers Inc., Oct. 2001. ISBN 1-5586-
0286-0.
[191] R. Gonzales. Xtensa: A configurable and extensible processor. IEEE Micro, 20(2):60–70,
Mar. 2000.
[192] R. Hank, S. Mahlke, R. Bringmann, J. Gyllenhaal, and W. Hwu. Superblock Formation
Using Static Program Analysis. In Proc. of the 26th Symposium on Microarchitecture, pages
247–255, Dec. 1993.
[193] R. Krishnan. Future of Embedded Systems Technology. In BCC Research Group, Jun. 2005.
[194] R. L. Sites. Alpha Architecture Reference Manual. Digital Press, Burlington, MA, 1992.
[195] R. Leupers. Retargetable Code Generation for Digital Signal Processors. Kluwer Academic
Publishers, 1997.
188 Bibliography
[196] R. Leupers. Exploiting conditional instructions in code generation for embedded vliw pro-
cessors. In Proc. of the Conference on Design, Automation & Test in Europe (DATE),
1999.
[197] R. Leupers. Code selection for media processors with SIMD instructions. In Proc. of the
Conference on Design, Automation & Test in Europe (DATE), pages 4–8, 2000.
[198] R. Leupers. LANCE: A C Compiler Platform for Embedded Processors. In Embedded
Systems/Embedded Intelligence. Feb. 2001. http://www.lancecompiler.com/.
[199] R. Leupers and P. Marwedel. Retargetable Generation of Code Selectors from HDL Pro-
cessor Models. In Proc. of the European Design and Test Conference (ED & TC), pages
140–144, 1997.
[200] R. Leupers, and P. Marwedel. Retargetable Compiler Technology for Embedded Systems.
Kluwer Academic Publishers, Boston, Oct. 2001. ISBN 0-7923-7578-5.
[201] M. S. McCorquodale F. H. Gebara K. L. Kraver M. R. Guthaus R. M. Senger, E. D. Marsman
and R. B. Brown. A 16-bit mixed-signal microsystem with integrated cmos-mems clock
reference. In Proc. of the Design Automation Conference (DAC), pages 520–525, 2003.
[202] R. Milner, M. Tofte, and R. Harper. The definition of Standard ML. MIT Press, Cambridge,
MA, USA, 1990.
[203] R. Ravindran and R. Moona. Retargetable Cache Simualtion Using High Level Processor
Models. In Proc. of the Computer Security Applications Conference (ACSAC), Mar. 2001.
[204] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman.
PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators. Proc. of the
IEEE Workshop on VLSI Signal Processing, 31(2):127–142, 2002.
[205] R. Wilhelm and D. Maurer. U¨bersetzerbau. Theorie, Konstruktion, Generierung. Springer-
Verlag, Mar. 1997. ISBN 3-5406-1692-6.
[206] R. Woudsma. EPICS, a Flexible Approach to Embedded DSP Cores. In Proc. of the Int.
Conf. on Signal Processing Applications and Technology (ICSPAT), Oct. 1994.
[207] Renesas. http://eu.renesas.com.
[208] S. Abraham, W. Meleis, and I. Baev. Efficient Backtracking Instruction Schedulers. In Proc.
of the Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 301–
308, May 2000.
[209] S. Aditya, V. Kathail and B. Rau. Elcor’s Machine Description System: Version 3.0. Tech-
nical Report, Hewlett-Packard Company, 1999.
Bibliography 189
[210] S. Bashford and R. Leupers. Constraint Driven Code Selection for Fixed-Point DSPs. In
Proc. of the Design Automation Conference (DAC), pages 817–822, 1999.
[211] S. Bashford, U. Bieker, B. Harking, R. Leupers, P. Marwedel, A. Neumann, and D. Vogge-
nauer. The MIMOLA Language, Version 4.1. Reference Manual, Department of Computer
Science 12, Embedded System Design and Didactics of Computer Science, 1994.
[212] S. Basu and R. Moona. High Level Synthesis from Sim-nML Processor Models. In Proc. of
the Int. Conf. on VLSI Design (VLSID), page 255. IEEE Computer Society, 2003.
[213] S. Farfeleder, A. Krall, E. Steiner, and F. Brandner. Effective compiler generation by
architecture description. In Proc. of the Int. Conf. on Languages, Compilers, and Tools for
Embedded Systems (LCTES), pages 145–152, 2006.
[214] P. Kessler S. Graham and M. McKusick. gprof: a Call Graph Execution Profiler. In ACM
SIGPLAN Symposium on Compiler Construction, pages 120–126, 1982.
[215] S. Hanono. Aviv: A Retargetable Code Generator for Embedded Processors. PhD thesis,
Massachusetts Institute of Technology, Jun. 1999.
[216] S. Kobayashi et al. Compiler Generation in PEAS-III: an ASIP Development System. In
Proc. of the Workshop on Software and Compilers for Embedded Systems (SCOPES), Mar.
2001.
[217] S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia
instruction sets. In Proc. of the Int. Conf. on Programming Language Design and Imple-
mentation (PLDI), pages 145–156, 2000.
[218] S. Larsen, E. Witchel and S. Amarasinghe. Increasing and Detecting Memory Address
Congruence. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques
(PACT), pages 18–29, 2002.
[219] S. Lavrov. Store economy in closed operator schemes. J. Comput. Math. Math. Phys.,
1(4):687–701, 1961.
[220] S. Mahlke, D. Lin, W. Chen, R. Hank and R. Bringmann. Effective compiler support for
predicated execution using the hyperblock. In Proc. of the International Symposium on
Microarchitecture (MICRO), 1992.
[221] S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers
Inc., Aug. 1997.
[222] S. Onder and R. Gupta. Automatic Generation of Microarchitecture Simulators. In Proc.
of the International Conference on Computer Languages (ICCL), pages 80–89, May 1998.
190 Bibliography
[223] SimpleScalar LLC. http://www.simplescalar.com.
[224] SPAM Research Group. SPAM Compiler User’s Manual, Sep. 1997.
http://www.ee.princeton.edu/spam.
[225] SPARC International Inc. SPARC Homepage
http://www.sparc.com.
[226] Stanford University. SUIF Compiler System. http://suif.stanford.edu.
[227] Synopsys. http://www.synopsys.com.
[228] T. Glo¨kler, S. Bitterlich and H. Meyr. ICORE: A Low-Power Application Specific Instruction
Set Processor for DVB-T Acquisition and Tracking. In Proc. of the ASIC/SOC conference,
Sep. 2000.
[229] T. Morimoto, K. Saito, H. Nakamura, T. Boku, and K. Nakazawa. Advanced processor
design using hardware description language AIDL. In Proc. of the Asia South Pacific Design
Automation Conference (ASPDAC), Mar. 1997.
[230] T. Morimoto, K. Yamazaki, H. Nakamura, T. Boku, and K. Nakazawa. Superscalar processor
design with hardware description language AIDL. In Proc. of the Asia Pacific Conference
on Chip Design Language (APCHDL), Oct. 1994.
[231] T. Proebsting and C. Fischer. Probabilistic register allocation. In Proc. of the Int. Conf.
on Programming Language Design and Implementation (PLDI), pages 300–310, 1992.
[232] Target Compiler Technologies. CHESS/CHECKERS
http://www.retarget.com.
[233] J. Teich and R. Weper. A Joined Architecture/Compiler Design Environment for ASIPs. In
Proc. of the Conference on Compilers, Architectures and Synthesis for Embedded Systems
(CASES), Nov. 2000.
[234] J. Teich, R. Weper, D. Fischer, and S. Trinkert. BUILDABONG: A Rapid Prototyping
Environment for ASIPs. In Proc. of the DSP Germany (DSPD), Oct. 2000.
[235] Tensilica Inc. Xtensa C compiler, http://www.tensilica.com.
[236] Texas Instruments. TMS320C54x CPU and Instruction Set Reference Guide, Oct. 1996.
[237] Texas Instruments Inc. Texas Instruments Homepage
http://www.texasinstruments.com.
[238] The Open Group. http://www.opengroup.org/architecture/adml/adml home.htm.
Bibliography 191
[239] The Open SystemC Initiative (OSCI). Functional Specifcation for SystemC 2.0
http://www.systemc.org.
[240] Trimaran. An Infrastructure for Research in Instruction-Level Parallelism
http://www.trimaran.com.
[241] UDL/I Comittee. UDL/I Language Reference Manual Version 2.1.0a, 1994.
[242] Underbit Technologies, Inc. MAD: A high-quality MPEG audio decoder.
http://www.underbit.com/.
[243] V. Kathail, S. Aditya, R. Schreiber, B. Rau, D. Cronquist, and M. Sivaraman. PICO:
Automatically Designing Custom Computers. Computer, 35(9):39–47, 2002.
[244] V. Katheil, M Schlansker, and B. Rau. HPL-PD Architecture Specification: Version 1.0.
Technical Report, Hewlett-Packard Laboratories, HPL-93-80R1, 2000.
[245] V. Rajesh and R. Moona. Processor Modeling for Hardware Software Codesign. In Int.
Conf. on VLSI Design, Jan. 1999.
[246] V. Zˇivojnovic´ and J.M. Velarde and C. Schla¨ger and H. Meyr. DSPStone – A DSP-oriented
Benchmarking Methodology. In Int. Conf. on Signal Processing Applications and Technology
(ICSPAT), 1994.
[247] V. Zˇivojnovic´, H. Schraut, M. Willems, and R. Schoenen. DSPs, GPPs, and Multimedia Ap-
plications - An Evaluation Using DSPstone. In Proc. of the Int. Conf. on Signal Processing
Applications and Technology, Oct. 1995.
[248] V. Zˇivojnovic´, S. Tjiang, and H. Meyr. Compiled simulation of programmable DSP archi-
tectures. In Proc. of the IEEE Workshop on VLSI Signal Processing, Oct. 1995.
[249] W. Chuang, B. Calder and J. Ferrante. Phi-predication for light-weight if-conversion. In
Proc. of the Int. Conf. on Programming Language Design and Implementation (PLDI), 2003.
[250] W. Geurts et al. Design of DSP Systems with Chess/Checkers. In Proc. of 2nd Int. Workshop
on Code Generation for Embedded Processors, Mar. 1996.
[251] W. Mong and J. Zhu. A retargetable micro-architecture simulator. In Proc. of the Design
Automation Conference (DAC), pages 752–757, 2003.
[252] W. Qin and S. Malik. Flexible and Formal Modeling of Microprocessors with Application
to Retargetable Simulation. In Proc. of the Conference on Design, Automation & Test in
Europe (DATE), Mar. 2003.
192 Bibliography
[253] X. Nie, L. Gazsi, F. Engel and G. Fettweis. A New Network Processor Architecture for
High-Speed Communications. In Proc. of the IEEE Workshop on Signal Processing Systems
(SIPS), pages 548–557, Oct. 1999.
[254] W. Qin X. Zhu and S. Malik. Modeling operation and microarchitecture concurrency for
communication architectures with application to retargetable simulation. In Proc. of the
Int. Conference on Hardware/Software Co-design and System Synthesis (CODES+ISSS),
pages 66–71, 2004.
[255] Y. Bajot and H. Mehrez. Customizable DSP Architecture for ASIP Core Design. In Proc.
of the IEEE Int. Symposium on Circuits and Systems (ISCAS), May 2001.
[256] Y. Kim and T. Kim. A Design and Tools Reuse Methodology for Rapid Prototyping of
Application Specific Instruction Set Processors. In Proc. of the Workshop on Rapid System
Prototyping (RSP), Apr. 1999.
[257] Y. Kobayashi, S. Kobayashi, K. Okuda, K. Sakanushi, Y. Takeuchi, and M. Imai. Synthesiz-
able HDL generation method for configurable VLIW processors. In Proc. of the Asia South
Pacific Design Automation Conference (ASPDAC), pages 842–845, 2004.
[258] Y. Lin, H. Lee, M. Woh, Y. Harel and S. Mahlke, T. Mudge, C. Chakrabarti, and K.
Flautner. SODA: A Low-power Architecture For Software Radio. In Computer Architecture,
2006. ISCA ’06. 33rd International Symposium on, pages 89–101, 2006.
Curriculum Vitae
Name: Manuel Hohenauer
Date of birth: 18. December 1974
Place of birth: Krefeld, Germany
Work Experience:
01/2002 – 08/2007 Research assistant at the department for Software for Sys-
tems on Silicon, Prof. Dr. Rainer Leupers, RWTH Aachen
University
08/2000 – 12/2000 Research assistant at the Institute for Singal Processign Sys-
tems, Prof. Dr. Heinrich Meyr, RWTH Aachen University
Education:
11/1994 – 06/2000 Study of Electrical Engineering and Computer Science
RWTH Aachen University
Diploma Thesis at the Institute for Integrated Signal Process-
ing Systems, RWTH Aachen University,
”Control-Dataflow-Graph based Bit-true C-Code Generation
and Code Optimization for the Digital Signal Processor
TMS320C62x”
Internships:
Energia San Juan, San Juan, Argentina
Philips GmbH, Tuner Factory, Krefeld, Germany
03/2000 Intermediate Diploma in Business Economics
04/1996 – 03/2000 Study of Business Economics, RWTH Aachen University
05/1994 Abitur (A-level Exam)
07/1985 – 05/1994 Gymnasium Thomaeum (grammar school), Kempen, Ger-
many
Krefeld, March 2009
Manuel Hohenauer
