Structural optimization of numerical programs for high-level synthesis by Gao, Xitong
Imperial College London
Department of Electrical and Electronic Engineering
Circuits and Systems Research Group
Structural Optimization of Numerical Programs for
High-Level Synthesis
Xitong Gao
Supervisor George A. Constantinides
Submitted in part fulfilment of the requirements for the degree of Doctor of
Philosophy in Electrical and Electronic Engineering of Imperial College Lon-
don and the Diploma of Imperial College London
September 20, 2016
1
This thesis is my own work and all other related work are appropriately referenced.
The copyright of this thesis rests with the author and is made available under a Creative Commons
Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or
transmit the thesis on the condition that they attribute it, that they do not use it for commercial
purposes and that they do not alter, transform or build upon it. For any reuse or redistribution,
researchers must make clear to others the licence terms of this work.
Xitong Gao
Structural Optimization of Numerical Programs for High-Level Synthesis
September 20, 2016
Supervisor: George A. Constantinides
Imperial College London
Circuits and Systems Research Group
Department of Electrical and Electronic Engineering
South Kensington Campus
SW7 2AZ and London
2
Abstract
This thesis introduces a new technique, and its associated tool SOAP, to automatically
perform source-to-source optimization of numerical programs, specifically targeting the
trade-off among numerical accuracy, latency, and resource usage as a high-level synthesis
flow for FPGA implementations. A new intermediate representation, MIR, is introduced to
carry out the abstraction and optimization of numerical programs. Equivalent structures
in MIRs are efficiently discovered using methods based on formal semantics by taking
into account axiomatic rules from real arithmetic, such as associativity, distributivity and
others, in tandem with program equivalence rules that enable control-flow restructuring
and eliminate redundant array accesses. For the first time, we bring rigorous approaches
from software static analysis, specifically formal semantics and abstract interpretation,
to bear on program transformation for high-level synthesis. New abstract semantics are
developed to generate a computable subset of equivalent MIRs from an original MIR.
Using formal semantics, three objectives are calculated for each MIR representing a
pipelined numerical program: the accuracy of computation and an estimate of resource
utilization in FPGA and the latency of program execution. The optimization of these
objectives produces a Pareto frontier consisting of a set of equivalent MIRs. We thus
go beyond existing literature by not only optimizing the precision requirements of an
implementation, but changing the structure of the implementation itself. Using SOAP
to optimize the structure of a variety of real world and artificially generated arithmetic
expressions in single precision, we improve either their accuracy or the resource utilization
by up to 60%. When applied to a suite of computational intensive numerical programs
from PolyBench and Livermore Loops benchmarks, SOAP has generated circuits that
enjoy up to a 12× speedup, with a simultaneous 7× increase in accuracy, at a cost of up





1.1 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Background 25
2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.2 RTL Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 HLS Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Loop Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 Modulo SDC Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.4 Obstacles in Adoption . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Program Analysis and Abstract Interpretation . . . . . . . . . . . . . . . . 46
2.3.1 Data-Flow Analysis Framework . . . . . . . . . . . . . . . . . . . . 47
2.3.2 Least Fixpoint Solution to a Data-Flow Analysis Problem . . . . . . 49
2.3.3 Abstract Interpretation with Intervals . . . . . . . . . . . . . . . . . 53
2.3.4 Abstract Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4 Intermediate Representations . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.4.1 Static Single Assignment Form and Control-Flow Graph . . . . . . 64
2.4.2 Equality Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5 Discovering Equivalent Programs . . . . . . . . . . . . . . . . . . . . . . . 72
2.5.1 Improving Performance by Rewriting Arithmetic Expressions . . . . 73
2.5.2 Rewriting Arithmetic Expressions for Accuracy . . . . . . . . . . . . 76
5
2.5.3 Numerical Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3 Structural Optimization of Arithmetic Expressions 85
3.1 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Resource Usage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3 Equivalent Expressions Analysis . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.1 Discovering Equivalent Expressions . . . . . . . . . . . . . . . . . . 92
3.3.2 Scalable Methods for Rewriting . . . . . . . . . . . . . . . . . . . . 94
3.3.3 Pareto Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.4 Equivalent Expressions Semantics . . . . . . . . . . . . . . . . . . . 98
3.3.5 Simultaneous Optimization of Multiple Expressions . . . . . . . . . 103
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4 Numerical Program Optimization 113
4.1 Syntax Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2 Metasemantic Intermediate Representation . . . . . . . . . . . . . . . . . . 118
4.2.1 Assignment statements . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2.2 Sequential statements . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2.3 Conditional Branches . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.2.4 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.5 Example analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.3.1 MIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.3.2 Composition Operator . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.3.3 Ternary Conditional Operator . . . . . . . . . . . . . . . . . . . . . 128
4.3.4 Fixpoint Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.4 Resource Usage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.4.1 Sharing conditional expressions . . . . . . . . . . . . . . . . . . . . 133
4.4.2 Resource sharing in composition expressions . . . . . . . . . . . . . 133
4.5 Equivalent Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.5.1 Equivalence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 136
6
4.5.2 Discovering Equivalent Structures Efficiently . . . . . . . . . . . . . 139
4.6 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.7.2 Quality of Resource Estimation . . . . . . . . . . . . . . . . . . . . 146
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5 Accurate and Resource Efficient Pipelining of Numerical Programs 149
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.2 Syntax Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3 Extending MIRs with Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.4 Structural Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4.1 Improved Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4.2 Transformation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.5 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.5.1 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.5.2 Resource Utilization Analysis . . . . . . . . . . . . . . . . . . . . . 167
5.5.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6 Conclusion 177
6.1 Future Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.2 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Bibliography 187
Appendices 199
A Sound Acceleration of Equivalent Expression Discovery . . . . . . . . . . . 201
B Formal Definitions of Equivalent MIR Discovery . . . . . . . . . . . . . . . 205
C Benchmark Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7
List of Figures
2.1 A high-level block diagram of an ALM in Stratix V, from Stratix V Device
Handbook [Alt15d]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Quartus II design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 The LegUp design flow, adapted from [Can+13] and [LU]. . . . . . . . . . 34
2.4 A simple dot-product example which calculates the dot-product of two arrays
A and B, each with 1024 elements. . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 The resulting schedule of the example program in generated. . . . . . . . . 37
2.6 The dependence graph formed by the data-dependences in the loop body
of the dot-product example in Figure 2.4. The dashed edge highlights the
inter-iteration dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 A simple program example to be statically analyzed. . . . . . . . . . . . . . 47
2.8 The CDFG of simple in Figure 2.7. . . . . . . . . . . . . . . . . . . . . . . 48
2.9 The compiled and optimized LLIR output from the dot-product example in
Figure 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.10 The CFG of the LLIR code in Figure 2.9. . . . . . . . . . . . . . . . . . . . . 66
2.11 A simple loop which computes the factorial of 10, and the resulting PEG.
This example and its PEG, showing computations that lead to the final x and
y, is taken from [Tat+]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.12 An simple E-PEG example, taken from [Tat+]. . . . . . . . . . . . . . . . . 72
2.13 An example APEG for the expression ((a + a) + b)× c, from [Mar12]. . . . 79
3.1 Our algorithm to compute clN f(), which discovers a set of equivalent
expressions with a ∪-distributive EEG f from an initial set of equivalent
expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8
3.2 The algorithm used to compute fr(), i.e. the Pareto frontier from a set of
equivalent expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 The DFG for the sample expression (a + b)× (a + b). . . . . . . . . . . . . 100
3.4 The DFG for finding equivalent expressions of (a+ b)× (a+ b). . . . . . . . 100
3.5 The alternative DFG for (a+ b)× (a+ b). . . . . . . . . . . . . . . . . . . . 102
3.6 Optimization of (a + b)2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.7 Simultaneous optimization of both e1 and e2. . . . . . . . . . . . . . . . . . 108
3.8 Varying the mantissa width of Figure 3.7. . . . . . . . . . . . . . . . . . . . 109
3.9 The Taylor expansion of sin(x+ y). . . . . . . . . . . . . . . . . . . . . . . . 110
3.10 The Motzkin polynomial em. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.11 Accuracy of Area Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1 A simple program, basel, written with our syntax definition. . . . . . . . . 118
4.2 Two pairs of programs that are equivalent but syntactically different. . . . . 118
4.3 A simple program which exhibits common subexpressions reuse across nested
MIRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4 The accuracy analysis of a fixpoint expression. . . . . . . . . . . . . . . . . 132
4.5 The sharing of conditional expressions in a simple program. . . . . . . . . . 133
4.6 The sharing of composition expressions. . . . . . . . . . . . . . . . . . . . . 134
4.7 The sharing of fixpoint expressions in a simple example program. . . . . . . 134
4.8 Example branch fusion creates a cycle. . . . . . . . . . . . . . . . . . . . . . 135
4.9 A transformed while loop with different partial loop unroll depths. . . . . 139
4.10 The Pareto frontier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.11 The quality of resource estimation. . . . . . . . . . . . . . . . . . . . . . . . 148
5.1 An overview of our automatic program optimization process. The shaded
region shows our internal tool flow. . . . . . . . . . . . . . . . . . . . . . . 152
5.2 An excerpt from the Seidel stencil [Pou]. The inter-iteration data-dependence
of the innermost loop is underlined (A[i][j] and A[i][j-1]). . . . . . 153
5.3 The optimized program using only arithmetic equivalences. . . . . . . . . . 154
5.4 The optimized program using arithmetic equivalences in tandem with control-
flow restructuring and memory access optimization. . . . . . . . . . . . . . . 155
9
5.5 A simple example program and its corresponding MIR. . . . . . . . . . . . . 158
5.6 An example to illustrate how MIRs are partitioned. . . . . . . . . . . . . . . 159
5.7 The algorithm used to sample the Pareto frontier. . . . . . . . . . . . . . . . 161
5.8 A sample MIR transformation using the independent accesses rule. . . . . . . 163
5.9 The optimized program that computes the Fibonacci sequence. It reduces
latency of the original in Figure 5.5a by half and improves accuracy by 50%. 163
5.10 The MIR with edges labelled with latency attributes. . . . . . . . . . . . . . 165
5.11 Comparisons of our estimated LUT counts against actual LUT counts from
VHLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.12 Comparisons of our estimated latency statistics against actual latency from
VHLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.13 Pareto-optimal variants of the Seidel stencil program from Figure 5.2. Each
graph shows a 2D projection of the 3D Pareto frontier. In each graph, the
original program is marked ×, and the lowest-latency variant obtained by
arithmetic transformations alone is marked by the red circle. . . . . . . . . . 176
6.1 Kahan’s compensated summation algorithm to accurately compute the sum
of n elements
∑n−1
i=0 xi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
C.1 simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
C.2 taylor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
C.3 filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
C.4 euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
C.5 pid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
C.6 sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
C.7 dotprod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
C.8 tridiag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
C.9 2mm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
C.10 3mm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
C.11 atax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
C.12 bicg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
C.13 gemm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
10
C.14 gemver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
C.15 mvt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
C.16 seidel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
C.17 syr2k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11
List of Tables
4.1 Table of optimization results. . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1 Comparison among the optimized implementations generated by Vivado
high-level synthesis (VHLS)’s expression balancing and our optimizer. The
row “Total run time (s)” indicates the wall-clock time in seconds of running
the synthesized circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.2 Before-and-after examples to demonstrate the access reduction rules. . . . 162
5.3 Comparisons of the original (non-shaded rows) and the optimized program
with lowest latency (shaded rows), for each benchmark. Values in paren-
theses are obtained after slightly tweaking our experimental set-up; see
Section 5.6.2. We performed place-and-route for exact statistics. . . . . . . 171
12
List of Acronyms




ALM adaptive logic module




CNN convolutional neural network
CPU central processing unit
DAG directed acyclic graph
DFA data-flow analysis
DFG data-flow graph
DSA dynamic single assignment
DSP digital signal processing
13
E-PEG program expression graph with equivalent structures
EB expression balancing
EDA electronic design automation
EEG equivalent expression generator
FPGA field-programmable gate array
GCC GNU Compiler Collection
GMP GNU Multiple Precision Arithmetic Library
GPU general-purpose graphics processing unit





IIR infinite impulse response
ILP integer linear programming
IMS iterative modulo scheduling
IR intermediate representation
LAB logic array block
LFP least fixpoint
LI loop invariant
LLIR low-level virtual machine intermediate representation




MII minimum initiation interval
MIR metasemantic intermediate representation
MLAB memory logic array block
MPFR GNU Multiple Precision Floating-Point Reliable Li-
brary
MRT modulo reservation table
PEG program expression graph
PID proportional–integral–derivative
RAM random access memory
RAW read after write
RecMII recurrence-constrained minimum initiation interval
ResMII resource-constrained minimum initiation interval
RTL register-transfer level
SDC systems of difference constraints
SMT satisfiability modulo theory
SSA static single assignment
TFLOPS tera floating-point operations per second
VHLS Vivado high-level synthesis
VLIW very long instruction word
XST Xilinx Synthesis Technology
15
16 List of Acronyms
1Introduction
As the computational capability of field-programmable gate array (FPGA) devices grows at
an exponential rate, numerical applications running on them become increasingly more
complex. Traditional design methodologies, working at the register-transfer level (RTL),
have become increasingly costly, forcing us to work at a higher abstraction level [Gaj+92;
BDT10a; Mee+12], e.g. to implement our designs using high-level languages (HLLs) such
as C.
There are many reasons why FPGA implementations of numerical algorithms are now best
obtained via high-level synthesis (HLS) from C: less development effort, the abundance of
software engineers compared to hardware designers, the relative ease of testing C code
on an ordinary microprocessor, the opportunities for rapid design space exploration, and
so on [GR94; Mee+12; Nan+16]. Great advances have been made in this area recently,
and the output from HLS tools is nowadays competitive with hand-crafted designs for
certain types of applications [BDT10a].
Numerical C programs are typically written with floating-point arithmetic, following
the IEEE 754 standard for floating-point computation [ANS08]. Floating-point numbers
can represent a wide range of real values, and most programming languages support
the standard seamlessly. The standard has become ubiquitous, and is used in most of
our software and hardware implementations of numerical programs. Recently, Altera
has introduced Arria 10 and Stratix 10 devices to incorporate hardened floating-point
digital signal processing (DSP) blocks in the FPGA fabric [Alt15c]. As a result, we expect
to see floating-point arithmetic continuing to dominate in the FPGA implementation of
numerical applications.
Although we make use of floating-point arithmetic to implement algorithms, generally
specified in real arithmetic, in practice, it is often neglected that floating-point com-
17
putations almost always have round-off errors, i.e. the discrepancy between the actual
result in real arithmetic and the rounded result computed with floating-point arith-
metic. Round-off errors, when accumulated, can have a devastating effect on numerical
accuracy [Hig02].
In fact, properties such as associativity (a + b) + c ≡ a + (b + c) and distributivity
a× (b+ c) ≡ a× b+ a× c which we consider to be fundamental laws of real numbers no
longer hold under floating-point arithmetic [Gol91]. For instance, under single-precision
floating-point arithmetic with rounding to the nearest, the result of (2−24 + 2−24) + 1 =
1.00000012 . . . is exact, but (1 + 2−24) + 2−24 is rounded to 1. Round-off errors in a
numerical program are dependent on every arithmetic operation and every input value,
and with the impact on floating-point accuracy being so esoteric, it is challenging for
engineers to understand the repercussions of switching between “(a + b) * c” and
“a * c + b * c” in their programs.
These numerical properties nevertheless open the possibility of using these rules to
generate a program equivalent to the original program in real arithmetic, but which
could have better quality than the original when evaluated in floating-point computation.
Experienced engineers often apply such expression rewriting intuitions in numerical
programs. For instance, when summing a sequence of floating-point values, one can
sometimes reduce round-off error in the result by summing the inputs in ascending order
of magnitude. On the other hand, one can often reduce latency by applying expression
balancing, i.e. rearranging operators in an expression to construct a balanced tree, so
that more operators can work in parallel. These heuristics cover a very limited number
of possible transformations and may not always improve the original code. There does
not exist a trivial process to apply steps of transformations using equivalence rules to
optimally trade off latency, resources and numerical accuracy.
Existing HLS tools consider these rewrites to be unsafe [Xil12], and thus make very
limited use of them when restructuring floating-point data-paths. For instance, Vivado
high-level synthesis (VHLS) [Xil12] has only a very simple expression balancing feature
that uses associativity to improve latency, and only expressions with either additions or
18 Chapter 1 Introduction
multiplications are optimized. Moreover, it does not produce optimal loop pipelining,
because it does not take into account the implications of these transformations on inter-
iteration dependences and does not explore partial loop unrolling. In addition, VHLS
cannot reason about how this feature affects numerical accuracy; there is no guarantee
that this transformation will not result in a catastrophically inaccurate implementation.
In response, this thesis proposes new methodologies and an associated tool—SOAP, a
fully automatic source-to-source optimizer that augments VHLS—to optimize a given
numerical C program using these transformations in tandem with conventional program
transformations. The optimizer discovers not only one, but a wide spectrum of program
candidates. When synthesized in VHLS, these candidates trade off three performance
metrics of great importance to engineers: run time, resource usage and round-off error.
Here, run time refers to the latency in clock cycles, resource usage refers to the number
of look-up tables (LUTs) and DSP elements. Some of these performance metrics could be
in conflict. For example, higher performance tends to require more circuitry, and how to
resolve this trade-off depends on the user’s requirements. As a result, the tool produces a
set of optimized programs, known as the Pareto frontier: those programs P for which the
tool has found no P ′ that improves on P in all three metrics. We thus go beyond existing
literature by not only optimizing the precision requirements of an implementation, but
changing the structure of the implementation itself.
The program optimization flow is safe and semantics-directed. Safety means that because
we base our method on formal mathematics to optimize programs, our approach can
be proved correct, in the sense that when executed using exact real arithmetic, the
transformed version produces exactly the same output values as the original program.
Semantics-directed transformation means that not only do we use program syntax, but
also the semantics, i.e. the underlying meaning of programs such as inferred numerical
accuracy, to guide optimization and guarantee safety properties of the optimized program.
Our technique obtains when necessary, by analyzing the program, a bound and a round-
off error bound on each variable in every program location. This information is then used
to guide program optimization, by analyzing and manipulating not only the syntax, but
also the semantics of programs.
19
Generating candidate optimizations naïvely would however produce a combinatorial
explosion, even for small input programs. For instance, in the worst case, the parsing of a
simple summation of n variables could result in (2n− 3)!! = 1× 3× 5× · · · × (2n− 3)
distinct expressions modulo commutativity, i.e. we make use of associativity but ignore
any distinction caused by commutativity [IM12; Mou11]. This is further complicated by
distributivity as the expression (a+ b)k could expand into an expression with a summation
of 2k terms each with k− 1 multiplications. Usually, for this reason, it would be infeasible
to generate a complete set of equivalent expressions using the rules of equivalence,
since an expression with a moderate number of terms will have a very large number
of equivalent expressions, and this number grows faster than exponential rate with the
increase of the number of terms. New approaches were therefore developed specifically
to tackle the efficient discovery of equivalent structures in numerical programs. First, we
invent a new intermediate representation, called metasemantic intermediate representation
(MIR) to reduce the size of our search space without affecting its optimality. Second, we
further reduce the effort of exploring the new search space by intelligently pruning the
set of candidates as it progresses up the input program’s abstract syntax tree.
All of the techniques above are designed with compositionality in mind. This means
that we recursively break down each component we work on—such as programs and
MIRs—into smaller components and use our analyses to calculate the results of each, such
as accuracy, area and latency, and subsequently equivalent candidates; the final results
are then constructed from those of the subcomponents. When compared with a global
approach, there are two major advantages. First, because components are considered
independent of each other, once analyzed the results can be reused in the analysis in
the larger enclosing components. Another advantage of this is that although we will not
formally prove the correctness of the methodologies in this thesis, they are designed to be
easily expressible in a formal language, and by proving small lemmas, we can deductively
prove the correctness of, for instance, a program-to-MIR translation.
20 Chapter 1 Introduction
1.1 Thesis Organization
Chapter 2 serves to explain various essential concepts used throughout this thesis, and
discuss others’ work in depth to better motivate the thesis contributions. The techniques
discussed in this thesis naturally extend HLS, a process to compile program written in HLLs
such as C into circuits. We therefore start by introducing the advantages of HLS, compared
against manual RTL implementations. We then bring rigorous approaches from software
static analysis, specifically program semantics and abstract interpretation, to source-to-
source transformation for HLS. We introduce existing intermediate representation (IR)
designed for program optimizations, and highlight the advantages and disadvantages
of them relevant to our requirements. We further explain existing program rewriting
techniques for numerical accuracy, which inspired our efficient discovery of equivalent
programs. Finally, we introduce the concept of loop pipelining and how restructuring
numerical programs can optimize run time.
In Chapter 3 we propose new methods to automatically optimize the structure of arith-
metic expressions for FPGA implementation as part of a HLS flow. This chapter introduces
the basis of our approach to perform structural optimization on numerical programs,
taking into account axiomatic rules derived from real arithmetic, such as distributivity,
associativity and others. A new efficient method is proposed to generate a computable
optimized subset of equivalent expressions from an original expression. Our approach
explicitly target an optimized area/accuracy trade-off, by automatically rewriting arith-
metic expressions, and analyzing each expression rewritten for its accuracy and area
usage. This gives the synthesis tool the flexibility to choose an implementation satisfying
constraints on both accuracy and resource usage. Using our technique to optimize the
structure of a variety of real world and artificially generated examples in single-precision,
we improve either their accuracy or the resource utilization by up to 60%.
Chapter 4 presents a similar source-to-source optimization targeting the trade-off between
numerical accuracy and resource usage, but extends it to optimize general numerical
programs, including if statements and while loops. Because there are infinite number
1.1 Thesis Organization 21
of ways to rewrite numerical C programs, and many of these rewrites produce programs
that have the same resource usage, accuracy and latency properties, we introduce a novel
expression-based IR called MIR to reduce the number of rewrites to explore. In Chapter 4
we explain in detail the structure of MIRs, and the back-and-forth translation between
numerical C programs and MIRs. We efficiently discover equivalent structures in MIRs by
exploiting not only the rules of real arithmetic, such as associativity and distributivity, but
also rules that enable control-flow restructuring. Our numerical accuracy and resource
usage analyses are further extended to analyze MIRs. Additionally, we broaden the Pareto
frontier in our optimization flow to automatically explore the numerical implications of
partial loop unrolling and loop splitting. In real applications, the tool discovers a wide
range of Pareto optimal options, and the most accurate one improves the accuracy of
numerical programs by up to 65%.
The optimization techniques we have discussed so far have only been limited to minimiz-
ing area and round-off errors of numerical programs. Often such optimizations result
in programs with longer run time, yet there is potential for these transformations to
significantly reduce the run time of numerical programs while improving resources and
accuracy. In Chapter 5, we thus introduce a new analysis procedure to minimize yet an-
other objective: the total run time of the optimized program. Together with accuracy and
resource utilization, these three form the simultaneous goals we use to produce the Pareto
frontier. In this chapter, MIRs are extended with new operators to allow for arrays and
matrices in the source program, so that a wider range of practical numerical applications
can be optimized. Numerical programs typically spend most of their run time in loops,
hence state-of-the-art HLS tools use pipelining to schedule them efficiently. Still, the run
time performance of the resultant FPGA implementation is limited by data-dependences
between loop iterations. We thus present additional rewriting rules—memory access
reductions—along with arithmetic identities and control-flow transformations to alleviate
some of these dependence constraints. HLS tools cannot safely enable such rewrites
by default because they may impact the accuracy of floating-point computations and
increase area usage, whereas we optimize run time while controlling the implications on
accuracy and area. Again, the tool reports a multi-dimensional Pareto frontier that the
22 Chapter 1 Introduction
programmer can use to resolve the trade-off according to their needs. When applied to a
suite of PolyBench and Livermore Loops benchmarks, the tool generated programs that
enjoy up to a 12× speedup, with a simultaneous 7× increase in accuracy, at a cost of up
to 4× more LUTs.
Finally, Chapter 6 concludes this thesis by summarizing our research, and we discuss the
potential for how future research into the structural optimization of numerical programs
could further benefit the HLS community.
1.2 Contributions
In this section, we summarize the following original major contributions in this thesis:
• a new expression-based IR, metasemantic intermediate representation (MIR), devel-
oped based on the formal semantics of programs, to safely and considerably reduce
the size of the search space of equivalent programs [GC15];
• an efficient algorithm to discover equivalent expressions and MIRs through bottom-
up hierarchy, graph partitioning, and intelligent pruning of optimization candidates
by performing accuracy, area estimation and latency analysis [Gao+13];
• we bring together standard program equivalences that do not affect program behav-
ior (e.g. loop/branch splitting/merging, partial loop unrolling, rules that removes
extraneous array accesses), and non-standard transformation rules (e.g. arithmetic
rules), to be used in a novel way to significantly impact latency, resource usage and
accuracy of a numerical program [Gao+16];
• an accuracy analysis based on abstract interpretation to calculate bounds on the
output and their corresponding round-off errors of a given optimized expres-
sion [Gao+13] or program candidate [GC15];
• a scheduling analysis that estimates the latency and resource usage of a given
optimized candidate [Gao+16]; and
1.2 Contributions 23
• incorporating the above-mentioned techniques, the first optimizer toautomatically
and safely produce optimized programs (and subsequent RTL implementations with
VHLS) on the three-dimensional Pareto frontier of options that trade off run time,
accuracy, and area [Gao+16].
1.3 Publications
The original contributions in this thesis have lead to the following peer-reviewed publica-
tions:
• [Gao+13] Xitong Gao, Samuel Bayliss, and George A. Constantinides. “SOAP:
Structural Optimization of Arithmetic Expressions for High-Level Synthesis”. In Pro-
ceedings of the 2013 International Conference on Field-Programmable Technology
(FPT), pp. 112–119.
• [GC15] Xitong Gao and George A. Constantinides. “Numerical Program Optimization
for High-Level Synthesis”. In Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pp. 210–213.
• [Gao+16] Xitong Gao, John Wickerson, and George A. Constantinides. “Auto-
matically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level
Synthesis”. In Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 234–243.
24 Chapter 1 Introduction
2Background
The main focus of this thesis is to devise optimization methods to efficiently rewrite
numerical programs by automatically varying the structure of their control- and data-
paths for an improved trade-off between numerical accuracy, throughput and resource
utilization for HLS. This chapter therefore provides a review of recent developments in
related work and various concepts that closely relate to the foundation of this central
problem. This section serves to highlight the main research topic to be discussed in depth
in each section.
We introduce various computing architectures, and explain the pros and cons of each one
in Section 2.1. We then explain in detail how FPGAs can be used as an alternative to
general-purpose processors such as central processing units (CPUs) and general-purpose
graphics processing units (GPUs) to accelerate computational-intensive applications, by
drawing comparisons of architectural differences to processors. This is then followed by
an overview of how a traditional synthesis tool flow compiles RTL implementations into
circuits by taking Altera Quartus II [Alt10] as an example.
Because of the ever-increasing complexity of applications designed into circuits, low-level
approaches such as RTL design become intractable for humans [Gaj+92], and a gradual
adoption of HLS tools has been observed in the FPGA community in recent years. For this
reason, in Section 2.2, we review the current state and advantages of HLS, and describe
the stages employed by these tools to compile a program in a high-level language into
digital circuits. Because run time is one of our optimization objectives, we use modulo
systems of difference constraints (SDC) scheduling to further examine in greater detail how
it can make loops to run as efficiently as possible by pipelining them. This scheduling
algorithm is used by HLS tools such as LegUp [LU].
25
The methods discussed in this thesis rely heavily on the static analysis of numerical
programs based on abstract interpretation [CC77]. In Section 2.3, we therefore review
abstract interpretation, a theory to create an abstraction for mathematical objects that
are not directly computable—because they are undecidable, intractable or both—in order
to compute an approximation of them efficiently. We informally introduce this theory, by
way of reviewing it with a progressive program analysis example.
IRs are often used to facilitate program optimization and program analysis. Section 2.4
we therefore study in detail the existing IRs designed for program optimization. We
take a closer look at how the static single assignment (SSA) and control-flow graph (CFG)
approach of low-level virtual machine (LLVM) [LLIR] encodes a simple program example.
As this technique limits itself to optimizing a single IR of a program, a novel approach
known as equality saturation [Tat+] which uses a data structure to simultaneously
represent multiple optimization candidates, is further examined.
In Section 2.5 we review prior work on restructuring arithmetic expressions and numerical
programs. We examine existing optimization passes in software compilers and HLS tools.
Since HLS tools are LLVM-based and compile C or C++ programs, they borrow the
concept from software compiler frameworks, such as LLVM, of including optimization
passes that may improve performance of floating-point computations, while potentially
sacrificing numerical accuracies. On the other hand, we look at methods that could
improve accuracy by simultaneously exploring multiple rewrites of expression candidates.
Due to the calculation of the full set of equivalent programs often being computationally
intractable, these methodologies either make extensive use of heuristics, or abstract
interpretation techniques introduced in Section 2.3, in order to compute an under-
approximation (a potentially smaller set) of the set of all equivalent programs.
2.1 Field-Programmable Gate Arrays
When it comes to implementing computations, we often choose from a spectrum of
computing machines. These choices range from those with fixed architectures that
26 Chapter 2 Background
compute by executing software designs such as CPUs and GPUs, those that implement
custom hardware architectures, such as FPGAs, to those with custom integrated circuits
to carry out computations, i.e. application-specific integrated circuits (ASICs).
There has been a great amount of effort in recent decades to make fixed-architecture
machines run as fast as possible; many novel and intricate ideas were proposed and we
now have a variety of general circuits in CPUs to improve their performance [HP11]. For
instance, CPUs could have a pipeline that spans several clock cycles to fetch and decode
instructions, and access data from memory or registers to carry out computations. At the
same time, they could make predictions about the branches taken in control-flows, as the
pipeline must be flushed if an incorrect instruction is fetched, incurring a penalty in speed.
Superscalar architecture and out-of-order execution are used to increase instruction-level
parallelism. They may also exploit data-level and thread-level parallelism, in order to
maximize opportunities to parallelize computations. These are just the tip of the iceberg,
as many other architectural advancements exist.
The great majority of fixed-architecture computing machines are based on the von
Neumann architecture, which consists of three parts: a computation unit, a memory and
a bus between them to move data back and forth. Often applications running on these
machines could spend a majority of their time and energy to move data and instructions
in the memory from/to the right location in the processor as fast as possible, in order
to carry out arithmetic computations. Because computational tasks frequently reuse
input data and intermediate results, a hierarchy of caches, in tandem with cache-aware
compiler optimizations [KW03], are often used to mitigate the costs of exchanging data
between the processor and the memory. Despite these optimization efforts to run software
code as fast as possible, the processor-memory bus, which is often referred to as the
von Neumann bottleneck [Bac78], inherently exists and remains the limiting factor of
performance in the architecture. This phenomenon is known to many as hitting the
memory wall [Bac+13; WM94].
Custom architectures in general achieve much higher performance, thanks to their ability
to implement arbitrary digital circuits specifically designed for the application under
2.1 Field-Programmable Gate Arrays 27
consideration, and spatially distribute memory bandwidth and computations. This is in
stark contrast to microprocessors such as CPUs and GPUs, which utilize general-purpose
circuits to cope with a wide-range of applications. ASICs provide the best power efficiency
and performance among all above architectures, however they are often associated with
long development cycles and high costs; any updates to the design would require a
complete and expensive re-spinning of the circuits [Bac+13], as they are inherently non-
programmable. FPGAs provide a good trade-off between processors and ASICs. Not only
do FPGAs have better performance and power characteristics than fixed architectures,
they also offer high programmability which makes FPGAs cost-effective low-volume ASIC
replacements [PB04; Bac+13]. At the same time, with a much shorter development
period than ASICs, a hardware design on FPGAs can be implemented with a much lower
cost. With a shorter time to market, it further enables a substantially larger profit by a
competitively early market entry [Sem].
For the above reasons, being able to leverage parallelism from bit-level all the way to
the loop- and task-level, FPGAs have been increasingly used as high-performance and
low-power alternatives to CPUs and GPUs for many classes of applications [Bac+13;
Bro+10; SF08]. For example, Thomas et al. [Tho+09] reported a FPGA-based random
number generator can obtain a 260× speed-up, while costing less than 1% of energy to
produce each random sample, when compared to its software counterpart running on a
CPU. Microsoft initiated a mid-scale deployment of Stratix V FPGAs in their data center,
improving the throughput of their Bing web search engine by a factor of 95% [Put+14].
2.1.1 FPGA Architecture
FPGAs owe their high performance and power efficiency to the design of the architecture,
we thus use Altera Stratix V [Alt15d] as an example to explain the architecture. The
Stratix V fabric contains a two-dimensional array of logic array blocks (LABs). Each
LAB in turn consists of an array of 10 adaptive logic modules (ALMs). Figure 2.1 shows
a high-level block diagram of an ALM in Stratix V. In an ALM, multiplexers can be
configured to choose whether full adders and registers are used. Dedicated full adders
enable more complex Boolean functions to be implemented in a single ALM, whereas
28 Chapter 2 Background
the use of registers, which store intermediate values, determines whether the circuit is
combinational or sequential. The two LUTs in an ALM can be configured to compute a
combination of two arbitrary Boolean functions, each with up to 5 inputs from 8 inputs
in total. Stratix 10, slated to be released in the next couple of years, has up to 1.87
million ALMs and 7.47 registers in total for the most demanding applications [Alt15b].
Interconnects, another class of key configurable resources on FPGAs, enable the inputs
and outputs of ALMs to be wired, in order to form larger and complete circuits from the



















Figure 2.1. A high-level block diagram of an ALM in Stratix V, from Stratix V Device Hand-
book [Alt15d].
FPGAs with enough ALMs and interconnects can implement arbitrary digital designs. This
versatile architecture therefore overcomes the memory wall problem by not restricting
itself to the von Neumann architecture. As we have mentioned earlier, FPGAs can
implement a circuit that is individually tailored for the application, in contrast, CPUs have
general-purpose circuits designed for a wide range of applications, which may therefore
have lower power-efficiency and performance. Moreover, unlike the CPU which only has
a small set of registers, the FPGA with its flexibility and abundant registers, allows designs
to distribute memory blocks and computation units and place them in close proximity.
Traditionally, multipliers, when implemented as soft-logic in FPGAs, cost a large number
of ALMs. Stratix devices thus further include an array of hardened components to carry
2.1 Field-Programmable Gate Arrays 29
out arithmetic operations distributed on the FPGA fabric, known as DSPs blocks, or simply
DSPs. Because of the dedicated hardened circuits, DSPs compute faster than arithmetic
operators formed by ALMs only, meanwhile they free ALM resources to perform more
non-arithmetic computations. In Stratix V, each DSP is paired with a LAB. These DSPs
are fracturable, i.e. they can be configured in combinations to perform a wide variety
of arithmetic operations, ranging from those using a single DSP element to synthesize
three multipliers each with two 9-bit inputs, up to those combining four DSPs to form a
complex-number multiplier with two 27-bit inputs [Alt15d]. Computations with larger
inputs can also be implemented by using ALMs and DSPs to form larger arithmetic
circuits. Finally, Stratix 10 will introduce hardened floating-point DSPs, enabling IEEE
754 [ANS08] single-precision floating-point additions and multiplications, achieving a
performance of up to 10 tera floating-point operations per seconds (TFLOPSs) [Alt15c].
These DSP blocks can also be adapted to multiply fixed-point inputs.
DSPs accelerate arithmetic computations, however they need to be supplied with inputs
as fast as they can process to fully utilize them. In general, in most applications, data are
frequently reused by the same computation unit. Stratix V therefore includes dedicated
embedded memory called M20K blocks (20 Kb storage) to be arranged and combined
into dual-port random access memorys (RAMs). Half of the LABs on the device, called
memory logic array blocks (MLABs) can also be configured to become a 640-bit RAMs.
These memory blocks are distributed across the FPGA fabric, so that DSPs may find them
in proximity.
2.1.2 RTL Design Flow
Modern FPGAs—with up to several million LUTs, and thousands of embedded memory
and DSP blocks, wired through a programmable fabric of interconnects—are humanly
intractable to program at the granularity of these individual components [KD08]. FPGA
applications are thus commonly written in RTL hardware description language (HDL),
such as Verilog [IEE06] and VHDL [IEE09]. These HDL source programs implement the
desired hardware by describing the logic between registers.Electronic design automation
30 Chapter 2 Background
(EDA) tools can then automatically translate these descriptions into hardware circuits in
FPGAs.
EDA tools, go through several stages to synthesize HDL source code into circuits, To
explain these stages in depth, we take Altera Quartus II [Alt10] as an example design






Figure 2.2. Quartus II design flow.
Quartus II starts its compilation of the RTL program by verifying source code for syntax
and semantic errors and design specification for inconsistencies, then applies a methodol-
ogy, called technology mapping, which maps a graph of device-independent logic gates
in logic expressions onto a network of functional blocks (such as LUTs, DSPs and mem-
ory blocks) in the target FPGA device [CP08]; this generated network is known as a
technology-mapped netlist. In this process, synthesis tools may optimize the circuit by
performing additional transformations such as redundant logic removal [Alt10].
The following stage, place & route, utilizes a heuristic placement algorithm, which takes
as it inputs the netlist, together with a device map showing the location of each of its
functional units, in order to select a legal location on the FPGA for each functional block in
the netlist, such that the routing of these blocks is optimized [Bet08]. In general, synthesis
tools allow some freedom in the user’s preference of the placement of circuit. Additional
automated optimizations may be applied to improve performance. For example, Quartus
II has the option to enable register retiming [Alt10], which allows registers to move across
combinational logic to reduce critical path delay, i.e. the longest delay required for an
output of any source register to propagate to the input of any target register in the circuit.
The end result of this step is a circuit fully mapped on the target FPGA.
In the following step, timing analysis, the tool computes the longest delay of all critical
paths, which determines the maximum frequency at which the application can run. Users
2.1 Field-Programmable Gate Arrays 31
can also inspect the list of critical paths and their delay statistics, so that one may focus
their effort on optimizing the timing of these critical paths by, for instance, splitting them
up by adding registers.
In the simulation stage, the resulting design is simulated using EDA simulation tools.
The final step, programming is to translate the circuit generated by the tool into a
bitstream, which is a binary data file used to program the FPGA. Similar to processors,
which can be programmed by incrementally reading and executing instructions from an
executable program file, a bitstream is used to program the individual components such
as LUTs, DSP blocks, dedicated memory blocks and interconnects on an FPGA, so the
circuit is formed on the device. The difference between them is that while processors
continuously read instructions from memory, FPGAs are typically programed only once
during the initial setup, and the bitstream data are used spatially to infer a circuit rather
than sequentially as instructions [Guc08].
2.2 High-Level Synthesis
High-level synthesis (HLS) is the process of compiling a high-level representation of an
application (usually in C, C++ or MATLAB) into a RTL implementation [CM08; Gaj+92].
In turn, this RTL design can be synthesized into a circuit and programmed onto the
FPGA device. With modern HLS tools, some applications are synthesized to have similar
performance when compared with hand-crafted RTL implementations [BDT10b].
The major advantage of HLS tools is that they enable us to work in a HLL, as opposed to
facing labour-intensive tasks such as optimizing timing and designing control logic in the
RTL design process. This allows application designers to focus instead on the algorithmic
and functional aspects of their implementation [CM08], without concerning themselves
with the intricate details of manual RTL designs.
Another advantage of using HLS tools is that they are in general more productive and
less error-prone to work with, when compared with traditional RTL tools. The reasons
32 Chapter 2 Background
are two-fold. Firstly, a C description is smaller than a traditional RTL description by a
factor of 10 [CM08; BDT10b]. Secondly, RTL design can be notoriously difficult to debug,
whereas C code can be easily tested on an ordinary microprocessor, and mature debug
and analysis tools for C are freely accessible [Can+13].
HLS tools further benefit us in their ability to automatically search the design space with
a reasonable design cost [BDT10b], potentially exploring a large number of trade-offs
between performance, cost and power [McF+90]. Traditionally, this is generally much
more difficult to achieve in RTL designs because of their low-level nature.
With recent advancements in this area, HLS tools have received a resurgence of interest.
Many commercial tools have been released, such as Catapult High-Level Synthesis [MG],
Impulse C [Imp] and PICO [Sch+02], to meet the burgeoning demand from the FPGA
community. Xilinx incorporates a sophisticated HLS flow for C/C++ named VHLS into
its Vivado design suite [Xil15], and their SDAccel Development Environment [Wir14] for
C/C++/OpenCL allows data centers to leverage FPGAs. Altera’s HLS solution is their
Altera SDC for OpenCL [Alt15a] which accelerates OpenCL applications on their FPGA
devices. Besides commercial tools, many open-source HLS tools have also been released
in recent years, such as ROCCC [Naj07], Trident [Tri+05] and LegUp [LU]. LegUp is now
gaining significant traction in the research community.
2.2.1 HLS Design Flow
This section provides an overview of the stages taken by the HLS tool to compile a
C program into RTL implementation, by using LegUp [LU; Can+13] as our example.
LegUp is an HLS tool which compiles programs to run on a hybrid software/hardware
architecture, and its design flow is shown in Figure 2.3, which consists of three major
stages to be explained below.
The first stage is to determine which parts of an application on the function-level are
suitable candidates to be synthesized into hardware circuits, while the rest can be run
on a soft-processor. This stage starts by compiling a C source program into a software























Stage 1 Stage 2
Stage 3
Figure 2.3. The LegUp design flow, adapted from [Can+13] and [LU].
executable targeting an FPGA-based MIPS processor. This processor has additional circuits
designed to profile the software implementation of the original application. By running
the compiled application on this processor, this profiling ability allows the processor to use
statistics such as number of clock cycles, power and cache misses to identify problematic
parts of the program at the function level that will benefit from a hardware redesign, so
that the power efficiency and run time could be improved [Can+13].
After identifying functions of the application to be implemented as part of a hardware
architecture, the next stage is then to synthesize hardware designs from these functions.
LegUp’s synthesis toolchain is based on the LLVM compiler infrastructure [LA], and it
synthesizes C functions into circuits in a series of steps.
34 Chapter 2 Background
It starts by using the LLVM front-end to compile a C function into low-level virtual machine
intermediate representation (LLIR), a platform-independent IR that is capable of cleanly
representing HLLs [LLIR], conventional and HLS-focused compiler optimization passes
are used to transform the IR program, such that the result when synthesized will have
better performance when running on the FPGA.
This is then followed by the HLS tool flow, which consists of four logical steps: allocation,
scheduling, binding and RTL generation. The first step, allocation, extracts information
from the application and user requirements to be used in subsequent stages, e.g. mod-
ules and RAM blocks to be synthesized on the target device. This is then followed by
scheduling, which assigns the start and end states to each LLVM instruction in a finite
state machine [LU], using a scheduling algorithm based on the formulation of SDC [LU;
Can+13; CZ06]. Many applications spend most of their time in loops; a scheduling tech-
nique known as loop pipelining, is therefore used in HLS tools to make them run efficiently.
This technique admits greater parallelism by allowing instructions in consecutive loop
iterations to overlap as much as possible. LegUp uses modulo SDC scheduling [Can+14],
which we will cover in Section 2.2.3, to minimize the wall-clock time of a pipelined loop.
The third logical step, binding, assigns each operator in the program to functional units
to be synthesized into hardware, and maps program variables to registers. The rationale
behind this step is that operators such as multipliers and dividers that tend to use a lot of
LUTs and DSP blocks can be shared temporally. Sharing these functional units requires
multiplexers, which is relatively expensive to implement in FPGA. Each assignment of an
operator to a functional unit is thus associated with a cost. The problem of minimizing
this cost is called the assignment problem, which is efficiently solved in LegUp with a
polynomial time complexity using the Hungarian algorithm [Can+13; Kuh10]. Finally,
the RTL generation step gathers information produced from the previous three steps, to
generate Verilog source code corresponding to the C function being compiled.
The third, and also the final stage, is to integrate software and hardware components of
the application into the FPGA device. Following [Can+13], we explain it as follows. Firstly,
custom accelerator circuits generated by HLS, a MIPS processor, and communication
interfaces between them are synthesized and programmed into the FPGA device. Because
2.2 High-Level Synthesis 35
some of the functions in the original C source code were implemented as hardware
accelerators in the HLS compilation flow, LegUp replaces them with wrapper functions
which can invoke the hardware accelerators at runtime. Finally, this modified source code
can then be compiled into a MIPS binary to be executed on the FPGA.
2.2.2 Loop Pipelining
Loop parallelism, and consequently, program run time is one of the main optimization
objectives we optimize later in Chapter 5. Hence in this subsection we first introduce the
concept of loop pipelining. We consider our example program dotprod in Figure 2.4,
which computes the dot-product, d, of two arrays A and B of floating-point values.
We assume both arrays are stored in the same RAM, which has one read port, and
accessing this RAM has a one cycle latency. We further assume no limits on the number
of arithmetic operators that can by allocated, floating-point multipliers and adders are
both fully pipelined, and use 7 and 10 cycles respectively to produce outputs.
#define N 1024
float dotprod(float A[N], float B[N]) {
float d = 0.0f;
for (int i = 0; i < N; i++) {




Figure 2.4. A simple dot-product example which calculates the dot-product of two arrays A and
B, each with 1024 elements.
A trivial way to schedule the loop in dotprod is to allow each iteration to complete
before starting the next iteration; this is however not very efficient. As we can see in
Figure 2.5, with a good schedule, operations across loop iterations can often temporally
overlap, giving way to parallelism and improve performance of the loop execution. In
Figure 2.5, iterations are laid out in rows, each clock cycle is a column, mul and add
are multiplication and addition respectively, A[0] and B[0] are read accesses from the
two arrays, and the arrows indicate the data-flow of d across iterations. This schedule
36 Chapter 2 Background
allows consecutive iterations to start every 10 cycles; and this number of clock cycles
between the start of consecutive iterations is known as the initiation interval (II). Loop
iterations in this schedule repeat for 1024 times (the trip count, N), and each iteration
requires 20 cycles (the depth, D, of the loop), as a result the overall latency L of this loop
is (N − 1)× II +D = (1024− 1)× 10 + 20 = 10, 250 cycles.













Figure 2.5. The resulting schedule of the example program in generated.
Any valid schedule of this loop must satisfy the constraints imposed by data-dependences.
For instance in our example, it is clear that in a single iteration, in the loop body,
multiplication of A[i] and B[i] must precede addition of d and the multiplied result.
Furthermore, in the (i + 1)-th iteration, access to the variable d must wait until d is
updated with a new value in the i-th iteration, data-dependences therefore also exist on d
across loop iterations. We call the former kind of dependences intra-iteration dependences
and the latter inter-iteration dependences.
Besides data-dependence constraints, the number of resources available also affects
loop scheduling. For instance, under our assumption in dotprod the RAM can only
be read once per cycle, our schedule thus should avoid reading from the same RAM in
the same clock cycle. We say a schedule is optimal, in the sense that the overall latency
L = (N − 1)× II +D, is minimized, while none of the constraints are violated. However
with a much more complex program, finding the optimal schedule is often an intractable
task. Limits on resource availability, along with dependence constraints, make scheduling
an NP-hard problem which is difficult to solve optimally and efficiently [Hwa+91]. In
the following part of this section, we discuss modulo SDC scheduling [ZL13; Can+14]
used in HLS tools, such as LegUp, to efficiently attack the scheduling problem.
2.2 High-Level Synthesis 37
2.2.3 Modulo SDC Scheduling
Many algorithms exist to schedule pipelined loops. For example, Fan et al. [Fan+08]
proposed that a schedule can be found by formulating the constraints into asatisfiability
modulo theory (SMT) problem and use an SMT solver to modulo schedule operations.
An alternative technique, iterative modulo scheduling (IMS) [Rau94], has been a widely
adopted by compilers that use software pipelining to schedule instructions for very
long instruction word (VLIW) processors [MS03]. This method has also been widely
adopted in HLS tools such as PICO-NPA [Sch+02], Trident [Tri+05] and LegUp [Can+13;
Can+14]. IMS for software pipelining [Rau94], however did not consider operator
chaining, i.e. allowing operations with combinational logic in a data-flow sequence to be
carried out in the same clock cycle. Schreiber et al. [Sch+02] in their adoption of IMS in
HLS, found operator chaining to be non-trivial in IMS and requires static timing analysis
of combinational components [Can+14]. A new method, the modulo SDC scheduling
algorithm, has thus recently gained traction and has been used by Caniset al. [Can+14]
in their LegUp and by Zhang et al. in [ZL13], because an SDC formulation is more suited
to model the effect of chaining operators. For this reason, an overview of the modulo
SDC scheduling approach is provided in this subsection.
Constructing the Data-Dependence Graph
In the first stage of modulo SDC scheduling, dependence relations are extracted from
the body of this loop. These dependence relations form a dependence graph, where
vertices are operations, and edges between pairs of vertices indicate dependence relations.
This dependence graph can subsequently be used to derive data-dependence constraints.
Figure 2.6 shows the complete dependence graph of dotprod’s loop body.
Intra-iteration dependence edges in the graph are each labelled with 〈l, 0〉, a pair of
attributes of integers accordingly. The first integer l signifies the number of clock cycles
that must elapse between the start of the predecessor and the successor operations. The
latter value 0 indicates that the dependence occurs in the same iteration. To illustrate,









Figure 2.6. The dependence graph formed by the data-dependences in the loop body of the
dot-product example in Figure 2.4. The dashed edge highlights the inter-iteration
dependence.
the edge between the multiplier “×” and adder “+” has an attribute 〈7, 0〉, because “×”
takes 7 cycles to generate an output.
Additionally, inter-iteration dependences create elementary cycles in the dependence
graph. For example, in each iteration the initial value of d depends on the final value of
d from the previous iteration. In this graph we thus add an edge from the output of the
addition to the variable d. We further describe that this dependence has a dependence
distance of 1, as 1 iteration must elapse between the start of each pair of value updates
and its corresponding use. This edge is then assigned an attribute 〈10, 1〉 which signifies
that the adder has a latency of 10 cycles, and this dependence has a distance 1.
Finding the Minimum Initiation Interval
Modulo SDC scheduling owes its efficiency to assuming an initial constant II and at-
tempting to search for a schedule that satisfies all constraints. This search stops if a
feasible schedule is found, otherwise II can be incremented by 1 and the search is re-
peated until we discover a valid schedule. To begin, we can find a lower bound on II,
which we call the minimum initiation interval (MII), as our initial constant II. The MII
is computed such that all schedules with an II less than MII violate some constraints.
For each of the inter-iteration dependences and resource constraints, an MII can be
found, respectively known as recurrence-constrained minimum initiation interval (RecMII)
and resource-constrained minimum initiation interval (ResMII) [Rau94; Can+14; ZL13].
2.2 High-Level Synthesis 39
We first introduce methods to compute both values, then the overall MII can then be
computed using the following equation:
MII = max (RecMII,ResMII) . (2.1)
Firstly, we compute ResMII, by finding the most constrained resources in the loop, as
these limits impact the ResMII value. For example, dotprod does not impose limits
on the number of floating-point operators that can be allocated, but assumes there is a
constraint which restricts the rate of memory accesses, i.e. only one read is allowed in
each clock cycle. Because each iteration requires two accesses to the same memory, a
ResMII = 2 will thus fully occupy the RAM throughput. To generalize this to all loops,
we consider for each type of operation ⊗ being used in the loop, the number of available
resources r⊗ for ⊗ and the number of occurrences n⊗(G) of ⊗ in the loop dependence









where OpTypes is the set of all types of operations, e.g. array accesses, arithmetic units,
and others, used in the loop.
The second step is to evaluate a minimal RecMII by ensuring for all cyclesc in the graph
G the following inequalities hold:






dist(e) ≤ 0, (2.3)
where Cycles(G) computes the set of all cycles in the graph G; e ∈ c enumerates all edges
in the cycle c; and for an edge e between two vertices v1 and v2 of the form v1
〈l,d〉−−→ v2,
dist(e) and lat(e) respectively evaluate to the latency l and dependence distance d. Hence,∑
e∈c lat(e) and
∑
e∈c dist(e) respectively sum the latencies and dependence distances









40 Chapter 2 Background
For example, in the dependence graph (Figure 2.6) of our simple program (Figure 2.4),
one cycle, d + , exists and the sums of latencies and dependence distances along this
cycle are 10 and 1 respectively, thus the RecMII of this graph is d10/1e = 10.
The simplest possible method of finding RecMII is therefore to enumerate all cycles in
the graph and compute the ratio between sums of latencies and dependence distances.
Unfortunately in the worst case, the number of cycles is exponential in the number of
edges in a graph, this approach could become intractable for large loops. An alternative
method based on the Floyd-Warshall shortest path algorithm [Flo62] which runs in
polynomial time, is thus proposed in [Rau94] to efficiently find RecMII.
Scheduling Operations
After assuming a tentative constant II, in modulo SDC scheduling, we try to construct
an SDC problem in order to solve for the schedule. We aim to assign each operation,
corresponding to each vertex v in the graph, to a time slot sv when it begins its operation.
While in this process, the scheduling must ensure that no assignment violates data-
dependence and resource constraints. For instance if a multiply operation is allocated
with a time slot in the second clock cycle, in each iteration it will start computation in the
second clock cycle of that iteration.
To begin, we ignore the resource limits for now and formulate an SDC problem for the
dependence constraints. For each dependence edge u
〈l,d〉−−→ v from vertex u to vertex v
in the graph, it is possible to write down the following inequality, where su and sv are
respectively the time slots for vertices u and v:
su − sv ≤ II× d− l. (2.5)
For instance, the edge × 〈7,0〉−−−→ + is an intra-iteration dependence, hence II does not
constrain the scheduling relation between these two operations and we substitute d
2.2 High-Level Synthesis 41
with 0 and l with 7, and the II term vanishes, to derive (2.6) below, and the back-edge
+ 〈10,1〉−−−→ d produces the following inequality in (2.7):
s× − s+ ≤ −7, (2.6)
s+ − sd ≤ II− 10. (2.7)
Besides dependence constraints, additional constraints are used to limit the length of
critical paths in combinational logic, such that for instance, a long chain of additions
can be broken down into multiple cycles to guarantee the frequency requirement. For
all paths u→ · · · → v between inter-iteration dependent vertices u and v consisting of
only combinational logic, it is possible to estimate a critical path delay delay(u, v). This
critical path delay is defined as the largest sum of propagation delays of each intermediate
operation along any combinational path from u to v. For a pair of dependent vertices u
and v such that delay(u, v) > Tclk, where Tclk = 1fclk and fclk is the target clock frequency,
we can create the following constraint [CZ06]:





This inequality ensures that for any the critical path delay betweenu and v greater than







Besides dependence and frequency constraints, Zhang et al. [ZL13] further introduces op-
tional ones, such as lifetime constraints, which aim to minimize the register requirements,
and relative timing constraints, which can be used to satisfy the timing requirement of
user-specified I/O protocols.
The rationale of using the SDC formulation to model constraints is that the feasibility
of an SDC system and the corresponding solution, if it exists, can be found efficiently.
More precisely, using the Bellman-Ford algorithm [Sch05], it can run in Θ(ln) time,
where l is the number of constraint inequalities andn is the number of variables [ZL13].
Additionally, an SDC problem can be incrementally solved, i.e. a new solution, if exists,
42 Chapter 2 Background
can be updated in O(m + n logn) time when a constraint is added or removed, by
using the algorithm presented by Ramalingam et al. [Ram+99]. In contrast, traditional
integer linear programming (ILP) scheduling techniques make use of O(mn) variables
to represent a scheduling problem, where m is the number of time slots and n is the
number of operations [Hwa+91], and solving this ILP problem often demands expensive
branch-and-bound procedures [ZL13] as ILP is NP-complete [Kar10].
Unfortunately, resources constraints, because of their non-linearity, cannot be easily
expressed as SDC constraints. Therefore, a data structure called the modulo reservation
table (MRT) is used to keep track of resource constraints as the loop is incrementally
scheduled [Can+14]. The MRT has II columns and each row tracks an available resource.
When a certain resource is used in the time slot su, the MRT records an entry for this
resource in column su mod II and the corresponding row of this resource. To illustrate,
consider our example in Figure 2.4, which assumes a single read to the memory in one
cycle for both arrays. The MRT thus has II = 10 columns, and 1 row for accessing the
memory. If A[i] is assigned with a time slot 0, then a schedule assigning B[i] to the
same time slot must be invalid, because a record exists forA[i] in row 1 column 0, and
thus B[i], which is competing for the same resource, hence must be scheduled in a later
time slot.
A typical modulo SDC scheduling algorithm begins with a schedule without resource
constraints. A priority function is then used to sort all resource-constrained operations by
perturbation, i.e. we place greater importance to operations that have a larger impact on
the schedule when they are moved [Can+14]. For each operator u in this sorted list, if u
is currently scheduled at time slot tu and does not have a resource conflict in the MRT, a
new constraint su = tu is constructed, otherwise a different constraint su ≥ tu + 1 is used
to ensure the operation u is scheduled at least one cycle later, so that it does not compete
for the resource in time slot tu. This newly created constraint is then tentatively added
to the SDC problem. For a feasible resulting SDC problem, a new solution can be found
incrementally, otherwise the algorithm backtracks to a latest feasible SDC formulation
and tries to schedule other operators before u. A time budget can also be used to limit the
number of attempts to schedule resource-constrained operators; if a valid schedule cannot
2.2 High-Level Synthesis 43
be found under the given budget, the II can be incremented by 1 to relax the dependence
and resource constraints, and the above mentioned procedure can be repeated until a
valid schedule is found.
2.2.4 Obstacles in Adoption
High-level synthesis, with its development-cost advantage over traditional RTL design
paradigm, is gaining traction in the circuit design community. It is, however, in its early
phase, and these tools still pose challenges in terms of using them as early adopters.
HLS tools may have limited support for HLL constructs. For instance, VHLS requires
pointer arrays to reference values or arrays of values, platform-specific functions such
as memcpy() and memset() are supported but const values must be used, and finally,
while tail-recursive functions written with C++ template constructs can be transformed
into loops in compile time, recursion in general cannot be implemented [Xil12]. Addi-
tionally, software programs often rely on libraries, many of which are platform-specific,
whereas in HLS, these libraries may not be appropriate and may likely be unavailable.
These above limitations make migrating existing software source code to a functional
HLS design a demanding task.
Optimizing C code for HLS could be a laborious process and requires expertise in hardware
design. Although software compilers face similar challenges to make programs run faster,
experienced programmers can often manually fine-tune software implementations for
performance. Because of the flexibility of HLLs, currently it is difficult for designers
to apply common intuitions, and the quality of the synthesis result may be difficult
to predict [Gup+04]. Winterstein et al. [Win+13] implemented K-means clustering
algorithms in HLS, and discovered that with extensive manual code transformations and
#pragma statements that are specific to VHLS, the tool can be persuaded to produce an
efficient circuit. When compared to the RTL counterparts, their HLS designs achieved
up to 40% of the performance in terms of area-time product. Zhang et al. [Zha+15]
implemented a convolutional neural network (CNN) accelerator in VHLS, they optimized
their design by program transformations techniques such as loop interchange, tiling,
44 Chapter 2 Background
pipelining and unrolling, and noticed that by enumerating combinations of loop tile sizes,
loop nest ordering and unroll factors, they were able to select the best implementation by
analytically estimating the throughput of each. Similarly, Sudaet al. [Sud+16] explored
the design space of their CNN accelerator by solving a resource-constrained throughput
optimization problem, in order to generate a high-performance CNN accelerator to be
synthesized in the Altera OpenCL compiler [Alt15a]. HLS tools can provide some syntactic
constructs to automate lower-level code transformations such as instruction parallelism,
loop pipelining and unrolling. It is, however, up to the engineer to decide how to
utilize these transformations, and to determine whether they will improve the design.
Higher-level optimizations such as the manual design space exploration explained in
earlier examples, may allow tools to vastly improve the performance of HLS applications.
Automating these techniques in HLS, however, remains a significant challenge.
HLS tools can often make use techniques such as loop pipelining discussed in Section 2.2.3
to detect and exploit parallelism at the instruction-level. Coarser-grain parallelism,
which is tightly associated with the algorithmic details, is however much more com-
plex [Nan+16], and many optimization opportunities are not yet exploited by HLS tools.
For instance, unlike the CPU, which has a monolithic storage and thus their performance
is often limited by the memory wall, the FPGA has dedicated RAM blocks in the logic
fabric to distribute memory bandwidth via data reuse. HLLs such as C were designed
with a mindset of the von Neumann architecture, and the source code in C typically does
not specify how the memory hardware is utilized. For this reason, HLS tools must be able
to intelligently partition a monolithic memory into smaller chunks that can be accessed
in parallel, in order to maximize performance; how to attain this is still a challenging
research area [Con+11a; Con+12; Wan+13; Win+15].
Circuits produced by HLS tools are expected to be semantics-preserving. This means
that they should be functionally equivalent to the original C programs; in other words,
for any given program inputs, the tools should guarantee that the computed outputs
from the original C code and the synthesized circuit should be identical. A different
class of optimizations, which we call lossy optimization, break this promise by optimizing
data-paths in a way that may impact numerical accuracy. HLS tools could benefit from
2.2 High-Level Synthesis 45
these optimizations in the future, which bring further performance improvements that
cannot be attained by traditional optimizations alone. Because these approaches could
affect numerical accuracy, performance optimization and round-off error analysis may be
carried out simultaneously. We will further discuss lossy optimization methods such as
expression balancing enjoyed by VHLS and LegUp in Section 2.5.
2.3 Program Analysis and Abstract Interpretation
As our way of living is becoming increasingly dependent on programs, errors in safety-
critical system can incur huge expenses, and even cost lives. For example, the maiden
flight of Ariane 5 resulted in a failure, because of a software instruction failed to convert
a 64-bit floating-point number into 16-bit signed integer, as the result was too large to be
represented [Dow97]. The Patriot defense system failed to intercept an incoming missile
because of an accumulated round-off error in the system’s internal clock, which resulted
in the deaths of 28 people in 1991 [Off92]. Static analysis, a process of analyzing a piece
of program written in an HLL without executing it, is therefore a research topic of great
importance to prevent similar catastrophic errors and mitigate the cost of failure in the
future.
It is unfortunate that because of the halting problem [Tur37] and a direct consequence
of it, Rice’s theorem [Ric53], any nontrivial property on the outcome of a program is in
general undecidable. This means that an interesting property, a yes or no question which
is never always true or always false for all programs, is in general undecidable; or in other
words, it cannot be answered algorithmically. Even a question as seemingly innocuous as
“does this program return zero” falls into this category. A static analyzer, when faced with
such a question, therefore does not attempt to produce a definite yes or no, instead it
answers with either a definite yes or I don’t know [Min04]; and producing a meaningful yes
in an efficient manner poses a challenging task to static analyzers. Additionally, they often
rely heavily on formal techniques to perform well. Typical techniques employed include
symbolic execution, model checking [Kro+03], satisfiability modulo theories [DMB08],
data-flow analysis based on lattices [Nie+99], abstract interpretation [CC77], etc.
46 Chapter 2 Background
There are static analyzers specifically tailored to prove the absence of run-time errors
in computing systems. Astrée [Ast], based on the theory of abstract interpretation, has
been successful in proving the safety of the flight control systems of the Airbus A340 and
A380 series, and the automatic docking software of the Jules Vernes Automated Transfer
Vehicle [Bou+09]. Other static analysis tools which employ abstract interpretation
include MathWorks’s Polyspace Bug Finder [PBF], Fluctuat [Flu], and ECLAIR [ECL].
This section starts by introducing the data-flow analysis framework to analyze a simple
program, abstract interpretation is then applied to this example, and the properties of the
resulting analysis are further discussed.
2.3.1 Data-Flow Analysis Framework
In this section, we use the data-flow analysis (DFA) framework [Nie+99] to deduce the
semantics of a program named simple in Figure 2.7, which consists of only one variable
x. We assume an initial set ι ⊆ R of values of x, and the property pertaining to us is
whether a particular value xinvalid is unreachable. Computations are performed in real
arithmetic for simplicity. By computing an X, the set of all reachable final values of
x, it suffices to check xinvalid /∈ X. A sensible definition for the set of values that can
be reached by x is a subset of all real numbers R, i.e. an element of ℘ (R), where ℘ (R)
denotes the power set of R, also known as the set of all subsets of R.
real simple(real x) {




Figure 2.7. A simple program example to be statically analyzed.
The first step of DFA is to translate the body of simple into a control/data-flow graph
(CDFG), as shown in Figure 2.8 where each block consists of a single statement or
conditional, and the edges in the graph model the data- and control-flows. The tt and
2.3 Program Analysis and Abstract Interpretation 47
ff respectively highlight the control-flow branches taken when the conditional “x < 1”
evaluates to either true or false.
entry
x > 1






Figure 2.8. The CDFG of simple in Figure 2.7.
The individual blocks in the CDFG can therefore be defined as functions f : ℘ (R)→ ℘ (R),
where both its input and output are elements of ℘ (R). For instance, for the statement
“x *= 0.9;” a function f1 can be defined as follows:
f1(S) = {0.9v | v ∈ S}. (2.9)
Here, the definition of f1 indicates that for all possible input values v of x in the set S,
we multiply it by 0.9 and collect the multiplied results into a new set as the output off1.
Similarly, because “x > 1” has two conditional branches, two functions, f2,tt and f2,ff ,
respectively for both true- and false-branches of it can be defined:
f2,tt(S) = S ∩ {v ∈ R | v > 1},
f2,ff (S) = S ∩ {v ∈ R | v ≤ 1}.
(2.10)
where X ∩ Y computes the intersection of the two sets X and Y .
In the next step, the edges of the CDFG are labelled with numbers 0, 1, 2 and 3 to signify
different locations of the program. For each edge labelled i, it is now possible to compute
an A(i), a set of values that could be reached by x in a program execution at each location
48 Chapter 2 Background
i, by wiring up the functions f1, f2,tt and f2,ff that correspond to program statements.
This gives rise to the following system of data-flow equations:
A(0) = ι, (2.11)
A(1) = f2,tt(A(0) ∪A(2)), (2.12)
A(2) = f1(A(1)), (2.13)
A(3) = f2,ff (A(0) ∪A(2)), (2.14)
where A(0) ∪A(1) is the union of A(0) and A(1).
Unfortunately, computationally solving this system of equations is not an easy task. In
the rest of this section, the two significant impediments are explained, and subsequently,
theories are introduced to address them. Section 2.3.2 discusses how the system of
data-flow equations can be solved analytically for the most desired solution, using a
minimal set of reachable values for A(1) as an example. However, this analytical solution
cannot be computed by a machine. By building on top of this foundation, Section 2.3.3
therefore provides a practical solution to compute a safe approximation of this set by an
algorithm.
2.3.2 Least Fixpoint Solution to a Data-Flow Analysis Problem
There are multiple solutions to this system. For example, we can solve it manually by
substituting A(0) and A(2) in (2.12) with (2.11) and (2.13). We arrive at:
A(1) = (ι ∪ {0.9v | v ∈ A(1)}) ∩ {v ∈ R | v > 1}. (2.15)
It turns out that the set of all real numbers greater than 1:
A(1) = {v ∈ R | v > 1}, (2.16)
2.3 Program Analysis and Abstract Interpretation 49
is a solution to (2.15). Substituting A(1) in the right-hand side of (2.15) with this value
proves that it is indeed the solution for this equation, assuming all sets below are subsets





0.9v | v ∈ {v′ | v′ > 1}}) ∩ {v | v > 1}
=
(
ι ∪ {0.9v | v > 1}
)
∩ {v | v > 1} =
(
ι ∪ {v | v > 0.9}
)
∩ {v | v > 1}
=
(








ι ∩ {v | v > 1}
)
∪ {v | v > 1} = {v | v > 1}.
(2.17)
Intuitively, a manual inspection of simple finds that x can reach values v, 0.9v, 0.92v,
and so on, such that all values in this sequence are greater than 1, for each v ∈ ι; or more
succinctly, an alternative solution to A(1) should be:
A(1) = {v′ | v′ > 1 ∧ v′ = 0.9kv ∧ v ∈ ι ∧ k ∈ N}, (2.18)
where k ∈ N denotes k is one of 0, 1, 2, . . ., i.e. a natural number.
It is evident to us the latter solution (2.18) is more precise, hence more desirable, than
the former (2.16). Not only does it contain information the former has, i.e. all values
reachable by A(1) are greater than 1, it also expresses the fact that it only consists of
values of the form 0.9kv that are greater than 1, where v ∈ ι and k ∈ N. A useful
definition of preciseness is therefore the subset relation “⊆”. If it is known that X ⊆ X ′,
and X and X ′ are both solution to a system of data-flow equations, then X is clearly
more appealing than X ′.
The set ℘ (R), with a preciseness ordering “⊆”, is a partially ordered set. It has three fol-
lowing properties for anyX,Y, Z ∈ ℘ (R): it is reflexive, X ⊆ X; it has the antisymmetry
property, i.e. if X ⊆ Y and Y ⊆ X, then X = Y ; and finally it is transitive, if X ⊆ Y and
Y ⊆ Z, then X ⊆ Y . In contrast to a total order such as the set of reals R, not every
two elements in ℘ (R) can be compared, e.g. neither of the sets {1, 2, 3} and {2, 3, 4} is a
subset of one another.
50 Chapter 2 Background
For the purpose of computing the solution to A(1)’s equation (2.15), a function f :
℘ (R)→ ℘ (R) can be defined:
f(X) = (ι ∪ {0.9v | v ∈ X}) ∩ {v ∈ R | v > 1}, (2.19)
so that all solutions of the original equation (2.15) are now in this following set, which
are known as the fixpoints* of f :
Fix(f) = {X ∈ ℘ (R) | f(X) = X} . (2.20)
By using this particular definition of preciseness, it is clear that we wish to find the
fixpoint of f with the smallest number of elements in it. With this in mind, two important
questions arise:
1. Is the most precise solution unique? A unique most precise solution is defined as
the only one which is the most precise among all possible solutions to the systems of
data-flow equations. In other words, if it exists, then it is defined as the least fixpoint
(least fixpoint (LFP)) of f which is a subset of all other fixpoints, i.e. lfp(f) ⊆ Y
for any Y ∈ Fix(f). As we have discussed earlier, multiple fixpoints exist, and it is
possible that these fixpoint solutions are not comparable.
2. If a unique solution exists and it is unique, how do we find it? This is equivalent to
finding a way to compute the LFP lfp(f) using f .
As it turns out, to answer the first question, Tarski’s fixpoint theorem [Tar55; Nie+99]
can be used to prove that lfp(f) is indeed unique.
*Another common name for fixpoint is fixed point. To avoid being mistaken for the fixed point representation,
a binary number representation, the term fixpoint is used instead.
2.3 Program Analysis and Abstract Interpretation 51
For the second question, Kleene’s fixpoint theorem shows that in our example analysis,





Here, ∅ is the empty set, and a function of the form hn(x), where h : M → M for any
domain M and n ∈ N, is recursively defined as:
hn(x) =

h(hn−1(x)) if n > 0,
x if n = 0.
(2.22)
The functions fk(∅) for the first k + 1 iterations can be evaluated as follows:
f0(∅) = ∅, f1(∅) = ι ∩ {v | v > 1},
f2(∅) = f(f1(∅)) = (ι ∪ {0.9v | v ∈ ι}) ∩ {v | v > 1},
f3(∅) =
(
ι ∪ {0.9v | v ∈ ι} ∪ {0.92v | v ∈ ι}
)
∩ {v | v > 1}, . . . ,
fk(∅) =
(
ι ∪ {0.9v | v ∈ ι} ∪ . . . ∪ {0.9k−1v | v ∈ ι}
)
∩ {v | v > 1}.
(2.23)
Finally, the most precise solution to (2.15) can be computed using the LFP formula for





fk(∅) = {v | v > 1} ∩
⋃
k∈N
{0.9kv | v ∈ ι}
= {v′ | v′ > 1 ∧ v′ = 0.9kv ∧ v ∈ ι ∧ k ∈ N}.
(2.24)
Even though we have derived a method to statically analyze a program, significant
obstacles stymie the efficient usage of it. Firstly, in the case study of simple, because
the LFP is evaluated as the union of fk(∅) in a sequence, this sequence is likely to be
infinite, and thus cannot be computed fully. Secondly, the set of input values, ι, not only
determines the number of iterations necessary in order to calculate the LFP, but also
impacts the amount of computation required in each iteration. For instance if ι = {4}
then it is only necessary to track the computation for a single input value 4, whereas
when ι = {v | 0 ≤ v ≤ 1000}, there are infinitely many values in the set. As a result, in
52 Chapter 2 Background
general, the LFP of an arbitrary self-map function f : L → L is thus not computable in
finite amount of time. In Section 2.3.3, a method known as abstract interpretation is
introduced to overcome the computability problem.
2.3.3 Abstract Interpretation with Intervals
A framework of methods, known as abstract interpretation (AI), is proposed by Cousot
et al. [CC77] to formally mitigate the problem of computability in program analysis.
Instead of finding the LFP, which may not be computable, it is much more efficient to
work out an approximation of the LFP. Despite the outcome of an AI-based static analysis
not being as precise as the LFP, the significant benefits of AI is two-fold. Firstly, the
program analysis framework can now produce a “yes” or “I don’t know” answer to a
query of a program property in a finite amount of time. Secondly, it provides the means
to prove the correctness of an answer produced by the static analyzer using AI in formal
mathematics.
Here, a simple formulation of interval arithmetic is first introduced, which is then
exercised in the AI framework, again by using the simple example in Figure 2.7.
Interval Arithmetic
Interval arithmetic (IA), is a method to enclose a set of solutions that may arise in
computation problems [Moo+09]. The standard method is to use a pair of values [a, b],
to represent {v | a ≤ v ≤ b}, a potentially infinite set of real numbers between a and b.
As it is closely related to sets of reals and real arithmetic, IA gets the benefits of both
worlds.
Firstly, operations, similar to the set operations such as the subset relation “⊆”, union
“∪” and intersection “∩” from R, can also be defined for real intervals Interval. The
corresponding operations are known as the partial ordering “v”, join “unionsq” and meet “u”, a
partially ordered set with these operations defined for all elements within it is a complete
2.3 Program Analysis and Abstract Interpretation 53
lattice. The definitions of these operators on two intervals [a, b] and [c, d] are as follows:
[a, b] v [c, d] := a ≥ c ∧ b ≤ d,
[a, b] unionsq [c, d] := [min(a, c),max(b, d)],
[a, b] u [c, d] :=

[max(a, c),min(b, d)] if max(a, c) ≤ min(b, d),
⊥ otherwise.
(2.25)
where s := t indicates s is defined as t, min(x, y) and max(x, y) respectively compute
the minimum and maximum of x and y, and ⊥, > respectively denote intervals with no
elements, i.e. an empty interval, and the entire set of reals. For completeness, the above
relations can be further extended for > and ⊥, where X] ∈ Interval is an interval:
X] unionsq > := >, and X] u ⊥ := ⊥. (2.26)
As a consequence, ⊥ and > are respectively the least and greatest elements of Interval.
This means that ⊥ v X] and X] v > for any interval X].
Secondly, in a similar fashion to real arithmetic, arithmetic operations can also be defined
for real intervals:
[a, b] + [c, d] = [a+ c, b+ d] ,
[a, b]− [c, d] = [a− d, b− c] ,
[a, b]× [c, d] = [min(s),max(s)] ,
where s = {a× c, a× d, b× c, b× d} ,
− [a, b] = [−b,−a] .
(2.27)
A scalar value x, which can be abbreviated by x itself, represents an interval [x, x] in IA.
An Informal Approach to Approximation
AI is a theoretical framework to approximate mathematical objects that are not directly
computable. We start by explaining what it means to approximate, then show how an
approximation can be proved to be safe.
54 Chapter 2 Background
In the DFA of simple in Section 2.3.1, we arrived at a function f : ℘ (R)→ ℘ (R) defined
in (2.19), which is not directly computable. The IA introduced earlier could serve as
an inspiration to produce a different function f˜ : Interval → Interval, which is very








where ι] is an interval that bounds the set of initial values ι. There are two noticeable
differences between f and f˜ . Firstly, the domain, where f˜ carries out its computation,
is Interval instead of ℘ (R). Secondly, because the domain used is different from f , the
function definition is therefore updated accordingly for f˜ .
When given an input X, f must enumerate on all X to compute the result f(X) precisely.
As we have discussed earlier this is infeasible as X could contain infinite number of
values. Conversely, f˜ does not suffer from this problem, since (2.27) dictates any IA
operations can be performed by a finite number of real arithmetic computations.
As it is possible to prove that both fixpoint theorems in Section 2.3.2 hold for Interval





For example, consider the case when ι = [0, 10],
f˜0(⊥) = ⊥,
f˜1(⊥) = ([0, 10] unionsq 0.9⊥) u [1,∞] = [1, 10],
f˜2(⊥) = ([0, 10] unionsq (0.9× [1, 10])) u [1,∞] = [1, 10], . . .
(2.30)
As all other values in the sequence evaluates to [1, 10], lfp(f˜) is hence [1, 10], which was
computed in 3 iterations. It is easy to see in Figure 2.7 that the reachable values of x
before executing the statement “x *= 0.9;” is indeed [1, 10].
2.3 Program Analysis and Abstract Interpretation 55
In many cases, an algorithm to compute f˜k(⊥) can terminate. It is clear that if f˜k(⊥) =
f˜k+1(⊥) for some k ∈ N, then for all integers j that are greater than k, f˜ j(⊥) = f˜k(⊥) and
there is no point in computing f˜ j(⊥). Widening and narrowing operators [CC77; Nie+99]
can be used to reduce the number of iterations required in the iterative computation
steps, hence accelerating, or even ensuring, termination by sacrificing precision of the
computed fixpoint, i.e. the result is no longer the LFP, but is a fixpoint X w lfp(f˜) that
can be computed more easily.
Galois Connection
Although our informal derivation gives us empirical evidence of the usefulness and
correctness of intervals in place of real sets, a series of questions pertaining the theory
behind it remain. The questions are of the format “is X a correct abstraction of Y ”,
where X and Y refer to each of the following pairs: (Interval, ℘ (R)), (f˜ , f), and
(lfp(f˜), lfp(f)).
Fortunately, Cousot et al. [CC77] show that if a Galois connection can be formed be-
tween ℘ (R) and Interval, then all of the above questions can be now answered with a
resounding “yes”. We can define [Nie+99]:
Definition 2.1. [Galois connection] A Galois connection is a relation between two complete
lattices 〈L,⊆〉 and 〈M,v〉, given by a pair of functions α : L→ M and γ : M→ L, such that
for all l ∈ L and m ∈ M:
α(l) v m if and only if l ⊆ γ(m), (2.31)
alternatively the following property may sometimes be easier to work with:
l ⊆ γ(α(l)) and α(γ(m)) v m. (2.32)
The Galois connection above can often be concisely written as:
〈L,⊆〉 −−→←−−α
γ 〈M,v〉. (2.33)
56 Chapter 2 Background
Here, the functions α and γ are often called the abstraction function and concretization
function respectively.
Bearing on the informal approach earlier, the following Galois connection can be estab-
lished to formalize it:
〈℘ (R),⊆〉 −−→←−−α
γ 〈Interval,v〉, (2.34)
The functions α : ℘ (R)→ Interval and γ : Interval→ ℘ (R) that satisfy (2.31) can be
defined as follows:
α(X) = [inf(X), sup(X)] ,
γ ([a, b]) = {v | a ≤ v ≤ b} ,
(2.35)
where inf(X) and sup(X) are respectively the infimum and supremum of X. Here α
takes a potentially infinite set of reals and translate it into a pair of values representing
an interval bounding the reals, whereas γ performs the backwards translation from an
interval to a set of reals. For instance:
α ({1, 1.2, 3}) = [1, 3] ,
γ ([1, 3]) = {v | 1 ≤ v ≤ 3} .
(2.36)
It is clear that information could be lost when the abstraction function α is applied. As
shown in the example above, a set of three real values1, 1.2 and 3 can be approximated
by an interval [1, 3], which represents a set of real values ranging from 1 to 3.
Furthermore, from a function g : L → L, the AI framework allows an approximated
function g] : M→ M to be inductively abstracted by the Galois connection [Nie+99]. For
example, consider (2.19) which is not computable, an approximated candidate f ] can be
computed by α ◦ f ◦ γ. Since we are computing in Interval, we assume that ι = γ(ι])
























0.9v | X] ≤ v ≤ X]
})




(ι ∩ {v | v > 1}) ∪
({
v | 0.9X] ≤ v ≤ 0.9X]
}




2.3 Program Analysis and Abstract Interpretation 57
We introduce  to denote an infinitesimal positive value. Because α(A∪B) = α(A)unionsqα(B):
f ](X]) = α
({
v | ι] ≤ v ≤ ι]
}









The term  vanishes when the abstraction function α is applied:
f ](X]) = α
({




















which is identical to f˜ defined in (2.28), therefore the correctness of f˜ is analytically
proven. Although we derived a special case using the approximate function induction
technique, in a more general fashion, this method can be applied to the functions and
operators in the DFA of reachable sets of reals. Consequently, the DFA of a general
program based on intervals can be induced.
Further Generalization
For simplicity, simple is used as an example of DFA. However unlike simple, general
programs can consist of more than one variable. A DFA for a program should therefore
compute a reachable set of values for each variable. A typical way do so is to use a
mapping which associates each program variable with a value, which is defined as an
element σ ∈ Σ, where Σ = [Var→ L] and L is a set of values. For example, a single
program state of simple could be σ0 ∈ [Var→ R], and σ0 = [x 7→ 0], which denotes
that the state σ0 has only a variable x, and x is assigned a value 0. We could also have
a state that captures multiple program states, e.g. σ] ∈ [Var→ Interval], where σ](x)
could be an interval.
Galois connections can be compositionally constructed [Nie+99]. If we know that
℘ (R) −−→←−−α
γ
Interval, then the following Galois connection between mappings can also
be constructed:




58 Chapter 2 Background
where α′(f) = α ◦ f , γ′(g) = γ ◦ g, and a term of the form s ◦ t, where s : B → C
and t : A → B denotes a function which accepts an input from A and produces an
intermediate output in B using t, then in turn s is used on the intermediate value to
compute a final output in C.
2.3.4 Abstract Domains
The interval domain offers us an efficient way to enclose reachable values of variables.
However, it is unable to capture the correlation among these variables. For instance, if
we know that x and y are reals between 0 and 1, and x ≤ y, intervals cannot express
the relation x ≤ y and using bounds x ∈ [0, 1] and y ∈ [0, 1], we evaluate the expression
(1− y)x is bounded by 0 and 1. In contrast, if we were able to make use of the inequality
x ≤ y, then it is possible to deduce (1− y)x ≤ (1− x)x ≤ 14 , which yields a much tighter
bound than using intervals alone.
In general, the design of abstract domains is a trade-off between how good a approxi-
mation it can obtain, and how efficiently it can be computed. For example, we could
imagine another abstract domain, Sign, which only captures the signedness of variables,
to also enclose a set of reachable values. Although it is even faster than intervals to
compute, it sacrifices the precision of the value bounds. Ranging from the fastest to the
most expensive, a hierarchy of abstract domains for reals and floating-point computations
are proposed by various authors as discussed below.
A new abstract domain, Octagon, is proposed by Miné [Min07a], to enclose values in a
system of difference-bound inequalities, which are of the form:
x− y ≤ c, or ± x ≤ c, (2.41)
where x and y are variables and c is a constant. This domain is very efficient as a Floyd-
Warshall algorithm [Flo62], which runs in O(n3), where n is the number of variables, can
be used to compute it [Min04]. Although not as efficient as intervals, it is much more
expressive than intervals. This is because intervals are equivalent to a set of inequalities
2.3 Program Analysis and Abstract Interpretation 59
of the latter form, ±x ≤ c, but they cannot represent the former which captures the
correlations between variables, x− y ≤ c.
Based on affine arithmetic (AA), Ghorbal et al. [Gho+09] propose an abstract domain,
Taylor1+. This domain uses an equation of the form to capture the bound on a variable
x:




where αxi ∈ R for each i, and each εi known as the noise symbol, which has an unknown
quantity that lies in [−1, 1], introduces a perturbation weighted by αxi on the constant αx0 .
If two variables x and y are correlated, x˜ and y˜ can consist of the same εi to encode the
correlation. They found that for a 2nd order infinite impulse response (IIR) filter, Taylor1+
can give an exact bound on the filter output, whereas both Interval and Octagon
failed as the bound size increases exponentially in each iteration, albeit slower than both
methods.
Cousot et al. [CH78] introduce the polyhedra approach to abstract domains. This method
makes use of arbitrary polyhedra, which could be expressed by a set of linear inequalities
such as the following: 
2x + 7y ≤ 32,
8x− 3y + z ≥ 0.
(2.43)
Libraries that implement the polyhedra domain include APRON [Apr] and Parma Polyhe-
dra Library [PPL]. Although the polyhedra domain is very accurate for linear computa-
tions, Ghorbal et al. [Gho+09] find both libraries to be much slower than their AA based
approach.
In addition to abstract domains on reals and floating-point values, integer-based domains
are further presented by several authors. Granger [Gra89] introduce the simple congru-
ences on integers, which consists of equations of the form x = a mod b, where x is a
variable and a and b are integers. They further propose a linear congruences formulation
which is more expressive, for instance, an equation 3x + 4y = 5 mod 6 can be used to
capture the relation between integer variables x and y.
60 Chapter 2 Background
Round-off Error Analysis
In Section 2.5, we will discuss techniques which optimize numerical accuracies in a
program by restructuring it. Their methods utilize a common analysis of floating-point
round-off errors, which is based on a formulation of AI. In this section, this numerical
accuracy analysis approach is thus explained in detail.
Because of the finite characteristic of IEEE 754 floating-point format [ANS08], it is
not always possible to represent exact values with it. Computations in floating-point
arithmetic often induce round-off errors. Therefore, Martel [Mar07] bound the ranges
of the values in floating-point calculations, as well as the round-off errors introduced in
these computations. This accuracy analysis determines the bounds of all possible outputs
and their associated range of round-off errors for expressions.
Martel [Mar07] introduces an abstract error semantics for the calculation of round-off
errors in the evaluation of floating-point expressions. It consists of three components:
an abstract domain to enclose computed floating-point values and round-off errors, a
suite of partial order operations on the abstract domain, and finally, a set of arithmetic
operations to evaluate arithmetic expressions using the abstract domain.
First we define the domain E] = IntervalF × Interval, where Interval and IntervalF
respectively represent the set of real intervals, and the set of floating-point intervals
(intervals that each enclose a range of floating-point values). The value (x], µ]) ∈ E]
represents a safe bound on floating-point values and the accumulated error represented
as a range of real values.
Secondly, for the Cartesian product of partial orders, a partial order relation inherently
arises by allowing element-wise partial order operations [AJ94]. In other words, the
2.3 Program Analysis and Abstract Interpretation 61
following operations for the partial order relations can be induced for E], by using the

























































Furthermore, the error domain forms the following Galois connection:
℘ (F× R) −−→←−− IntervalF × Interval, (2.45)
Finally, arithmetic operations can also be defined for the values in E]. To start we
introduce the functions ↑]◦ and ↓]◦, which respectively compute the interval of rounded
floating-point results and the range of round-off error from arithmetic operations under a
given rounding mode. The function ↑]◦ : Interval → IntervalF computes the floating-
point bound from a real bound, by rounding the infimum a and supremum b of the input
interval [a, b]:
↑]◦ ([a, b]) := [↑◦ (a), ↑◦ (b)]F. (2.46)
where the subscript F indicates the interval is a floating-point interval, and ↑◦ (v) indicates
rounding a value v to a floating-point value. The function ↓]◦ : Interval → Interval
determines the range of round-off error due to the floating-point computation:






, where z = max(ulp(a), ulp(b)). (2.47)
Here z denotes the maximum rounding error that can occur for values within the range
[a, b], and the unit of the last place (ulp) function ulp(x) characterizes the gap between
two floating-point numbers closest to x. However there are multiple different variations
of the definition of ulp, which one is the most fitting is still subject to debate [Mul05].
62 Chapter 2 Background
We can now define the arithmetic operations on values in E]. For addition, subtraction



































































































The addition, subtraction and multiplication of intervals follow the standard rules of IA
defined earlier in (2.27).
Expressions can be evaluated for their accuracy by the method as follows. Initially, for
real-valued variables, the following function can be used to convert an interval of real
















For example, for the real variable a ∈ [0.2, 0.3] under single precision with rounding to
nearest:
cast ([0.2, 0.3]) = ([0.200000003, 0.300000012]F, [−1/67108864, 1/67108864]) . (2.51)
After this, similar to the way we use intervals to analyze a program in Section 2.3.3,
floating-point arithmetic expressions can be analyzed. For instance, the expression
(a + b)2 can be evaluated just as we would expect in real arithmetic, instead, the arith-
metic operators are overridden to operate on values in the error domain E].
2.3 Program Analysis and Abstract Interpretation 63
For example, assume that real variables a ∈ [0.2, 0.3], b ∈ [2.3, 2.4], it is possible to
derive that in single-precision floating-point computation with rounding to the nearest,
(a + b)2 ∈ [6.24999857, 7.29000187] and the error caused by this computation is bounded
by [−1.60634534× 10−6, 1.60634534× 10−6].
2.4 Intermediate Representations
IRs are data structures designed to be independent of the machine architecture and
source language. They are often invented with the intention to ease program analysis
and optimization in mind, by abstracting information from the original program that are
irrelevant to our objectives. In this section, we introduce several categories of existing
IRs, and delve deeper into the advantages and disadvantages of each.
2.4.1 Static Single Assignment Form and Control-Flow Graph
Traditionally, static single assignment form [Alp+88; Rau92] together with the control-
flow graph are used to represent data- and control-flow of a program [Cyt+91], because
they are more favorable program representations on which optimization passes can
be implemented, when compared to the original HLL or the output language. SSA
can be advantageous in implementing conventional optimization techniques, e.g. code
motion [Cyt+86], removing redundant computations [Ros+88], and constant propaga-
tion [Cyt+91]. Because the LLIR [LLIR] is based on SSA and CFGs, and is commonly
used in many HLS tools such as LegUp [LU], we introduce SSA and CFGs by compiling
the dot-product example in Figure 2.4 into LLIR as shown in Figure 2.9.
The LLIR of our example function consists of parts that are known as basic blocks (BBs).
Each BB in turn often contains a label that uniquely identifies the BB, a list of LLIR
statements in SSA form without any branches, i.e. the statements are executed sequen-
tially, and a terminator instruction, which is typically a branch instruction that leads the
control-flow to a different BB, by referencing a BB label or a function return.
64 Chapter 2 Background
define float @dotprod(
float* nocapture readonly %A,




; <label>:1 ; preds = %2
ret float %8
; <label>:2 ; preds = %2, %0
%i.02 = phi i32 [ 0, %0 ], [ %9, %2 ]
%d.01 = phi float [ 0.000000e+00, %0 ], [ %8, %2 ]
%3 = getelementptr inbounds float, float* %A, i32 %i.02
%4 = load float, float* %3, align 4, !tbaa !2
%5 = getelementptr inbounds float, float* %B, i32 %i.02
%6 = load float, float* %5, align 4, !tbaa !2
%7 = fmul float %4, %6
%8 = fadd float %d.01, %7
%9 = add nuw nsw i32 %i.02, 1
%exitcond = icmp eq i32 %9, 1024
br i1 %exitcond, label %1, label %2
}
Figure 2.9. The compiled and optimized LLIR output from the dot-product example in Fig-
ure 2.4.
The LLVM framework implicitly constructs a CFG from the IR code, which is a directed
graph representing the control-flow of a program. The vertices in the CFG constitute BBs,
while the edges indicate the control-flow directions (i.e. branches to other BBs), often
with predicate attributes to determine whether the branch is taken. For instance, we
consider the first line of the third BB in Figure 2.9:
; <label>:2 ; preds = %2 %0
which indicates it has a label value 2 and the control-flow coming to this BB is from either
BB2 or BB0, here we use BBn as a shorthand denoting a BB labelledn. Additionally, this
BB ends with the branch terminator instruction:
br i1 %exitcond, label %1, label %2
This instruction directs the control-flow to BB1 or BB2, and the variable %exitcond in
the terminator instruction decides which branch is taken. Finally, the complete CFG is
2.4 Intermediate Representations 65
shown in Figure 2.10. It is noteworthy that BB2 has two edges that leads to either BB1
or BB2 itself. If %exitcond evaluates to false (ff), then another iteration of BB2 will









Figure 2.10. The CFG of the LLIR code in Figure 2.9.
Each BB contains sequential computations that are represented by SSA instructions. The
SSA form describes the operations in the original program, such that each variable in it is
assigned exactly one value.
The sequence of instructions that assigns to %3-%9 in Figure 2.9 carries out most of the
computations in the program. It starts by reading A[i] and B[i], multiplying them
together, then adding the result with d to form a new variable, and finally, the iteration
value is incremented by 1. It may seem unusual that the accumulated sum of products
and the iteration value are not assigned to d and i respectively. We can imagine two
BBs, one initializes d and i to zeros, the other updates these two variables in a loop.
As all variables in an SSA form must be assigned once only, one of the BBs should use
different names for these two variables. When the control-flows of these two BBs join,
we must introduce a way to read from the variables that are assigned in the two BBs in
the succeeding BB. A new instruction, called the φ-function is therefore defined for our
purpose. The φ-function accepts two variable names as its inputs, and produces the value
66 Chapter 2 Background
of either variable as its output, determined by which preceding BB the control-flow came
from. For example, in LLIR, the instruction:
%d.01 = phi float [ 0.000000e+00, %0 ], [ %8, %2 ]
shows that if the control-flow originated from BB0, then a constant zero is returned,
otherwise the control-flow had to come from BB2 and the value of %8 is used instead.
The rationale of SSA is that we can abstract away anti- and output dependences by never
assigning to the same variable twice, while only true data-flow dependences remain.
An anti-dependence is a dependence relation when a read operation must precede a
write to the same variable, and an output dependence is when two writes refer to the
same location. By removing these dependences and deferring the analysis of them,
certain program optimization analyses can run much more efficiently. Analyses that
may benefit from this include scheduling [Rau94], liveness analysis (estimating the
lifetime of variables to reduce register requirements) [Cyt+91], detecting opportunities
for parallelism [CF87], and finding equivalent parts in the program [Alp+88].
In a cyclic CFG, the control-flow could potentially revisit a BB, and instructions in this
BB will inevitably assign a different value to the same variable, which forms anti- and
output dependences, which could have a detrimental effect on efficient loop pipelining
in some computing machines. An alternative IR, the dynamic single assignment (DSA)
form [Rau92] can therefore be used in place of the SSA to address this issue. The DSA
defines a linear sequence of virtual registers for each variable, such that every time the
variable is assigned in a dynamic execution path, a new virtual register is used.
Alternatives to SSA and CFG
There are a number of alternative IRs that are similar in construction to the SSA and
CFG approach in LLIR. For instance, the data-dependence graph [Rau94] introduced
in Section 2.2.3 is designed for the purpose of capturing data-flow dependences in
polyhedral methods. Data-flow graph (DFG) is a popular alternative to SSA, which is
often a directed acyclic graph (DAG). In general, a DFG’s vertices are input, output and
2.4 Intermediate Representations 67
operation nodes, and the edges capture the dependences between these nodes. A DFGs
however generally does not preserve enough information for us to reconstruct a program
from the graph itself. A group of data structures, known as CDFG [OG86], is commonly
used to represent programs in HLS tools, e.g. SPARK [Gup+04]. A CDFG resembles a
CFG such as the one used in LLIR, but in lieu of using sequential instructions in SSA form
in graph vertices, each vertex contains a DFG, where no SSA temporary variables are
used and data-flow dependences can by explicitly identified by edges.
2.4.2 Equality Saturation
The IRs we discussed above are all used to analyze and transform the underlying program
structure, so as to produce a new representation of the optimized program. In a con-
ventional optimizing compiler, program optimization is often carried out in a sequence
of transformation passes, where each pass accepts a program, often written in a certain
IR, and produces an optimized program in the IR. The traditional practice is to always
apply these optimization phases in a fixed order, but a good ordering of these phases
is crucial to achieve a good optimized result, and the optimal ordering varies across
applications being compiled [Alm+04]. The process of finding the optimal ordering is
known as the phase-ordering problem, which is in general undecidable [TB06]. Moreover,
programs running on CPUs or GPUs are usually quantified by their throughput or latency,
in contrast, designs on FPGAs concern us with additional objectives besides run time, such
as power consumption and resource utilization that impact the quality of synthesized
circuit. Multiple designs which trade-off these objectives could exist, and which design
to choose relies on the specifics of the use case. It is therefore sensible to explore the
design space by optimizing multiple objectives simultaneously. For the above reasons, it is
desirable to have an IR and the associated optimization procedures to efficiently discover
equivalent structures that lead to different implementations of the original program.
In software, a novel approach called equality saturation is proposed in [Tat+] to find
multiple possible optimized variants of the original program, and subsequently deal with
the phase-ordering problem. It creates a new graph-based IR, program expression graph
(PEG), to encode the effects of executing the program.
68 Chapter 2 Background
To begin, we review the structure of the PEG, by considering a simple loop example in
Figure 2.11. By understanding how the PEG can be evaluated for the output values,
we can interpret how the PEG captures the control- and data-flow information of the
program. PEGs, similar to arithmetic expressions expressed in a tree structure, can be
evaluated in a bottom-up fashion, by recursively propagating computed values from the
leaf nodes to the root of the tree. However, unlike arithmetic expressions which are
acyclic, edges in PEGs may form cycles to express loops in the original program.
int x = 1;
int y = 1;
while (y <= 10) {
x = y * x;















Figure 2.11. A simple loop which computes the factorial of 10, and the resulting PEG. This
example and its PEG, showing computations that lead to the final x and y, is taken
from [Tat+].
Data-Flow Nodes
All loops in the PEG are formed by θ nodes, which is used in the following form:
θ
ι func( )
where it accepts two child graphs, ι and func, and func further takes the θ node as one of
its inputs to form a complete cycle. Evaluating a θ node produces a list of values computed
iteratively by the node’s subgraph. The first value in the list, is the computed result of ι,
which we name i, and values in the rest of the sequence are iteratively computed by func.
In functional programming, this is similar to iteratively computing the fixpoint fixF of
an initial empty list [], which is defined as:
fixF ([]) = lim
n→∞F
n([]), where F (v) = prepend (i,map (func, v)) . (2.52)
2.4 Intermediate Representations 69
Here, map(func, v) applies the subgraph computation func to all elements in the list v,
and prepend(i, v′) prepends the element i to the list v′.
For example, the following subgraph extracted from Figure 2.11b evaluates to the




It is noteworthy that θ nodes may have subscripts. For instance, in Figure 2.11b, both
nodes θ1 share the same subscript 1. This is used to indicate that the two sequences
produced by both θ nodes iterate simultaneously, i.e. they share the same iteration count
so that a new value for x can be computed as we update y. Therefore, the θ1 node in the
left of this figure produces a sequence of the factorials of [1, 2, 3, . . .], i.e. [1, 2, 6, 24, . . .].
Computation nodes, such as arithmetic + and Boolean operators ≤ and ¬ in Figure 2.11b,
operates on a list of values, by performing the computation on each value in the list.
For instance, the ≤ node accepts two inputs, the sequence [1, 2, 3, . . .] derived earlier, and
a scalar 10, computes the result of x ≤ 10 for each element x in the sequence, and finally
produces the list, where tt and ff respectively denote true and false Boolean values:
[tt, tt, tt, tt, tt, tt, tt, tt, tt, tt,ff ,ff ,ff , . . .]. (2.53)
The subsequent ¬ node then negates all elements in the list:
[ff ,ff ,ff ,ff ,ff ,ff ,ff ,ff ,ff ,ff , tt, tt, tt, . . .]. (2.54)
Control-Flow Nodes
The θ node encodes an infinite sequence of computed values, whereas the output value
of the program is a scalar. By further representing control-flow information in a PEG, it
70 Chapter 2 Background
becomes possible to refer to a single value in this sequence, as the output of the program.
To do so, Tate et al. [Tat+] further introduce pass and eval nodes. The pass node finds
the first true (tt) value in a sequence of Boolean values, and returns the index of this
value, and eval takes two child nodes, where the first node evaluates to a listv of values,
and the second is an scalar n used to select a scalar value from l, as the output of this
node.
To illustrate, the pass1 node finds the first tt in the list (2.54), 11. The eval node of
the output variable y subsequently fetches the 11-th item from the list [1, 2, 3, 4, . . .] we
produced earlier, which is 11. Similarly we can apply the same process to find that the
output x is 10 ! , i.e. the factorial of 10.
By using pass and eval nodes to represent the control-flow in an algebraic fashion, and
mixing data- and control-flows together in the graphical representation, PEGs provide us
with greater opportunities to optimize data-flow across control-flow boundaries andvice
versa. Simple equivalence rules can be defined for these nodes algebraically. For instance,
arithmetic operators can be distributed over θ and eval, e.g. eval(j, k) + i ≡ eval(j + i, k).
Complex transformations can therefore be deductively constructed from these simple
equivalence rules.
Equivalence Finding
By applying transformation passes to the PEG, their approach detects incremental modifi-
cations, and appends these changes to the original PEG. The new changes, represented
as extra structures in the PEG, are linked to their corresponding equivalent nodes by
equivalence edges. These edges indicate a pair of subgraphs are equivalent, forming a
program expression graph with equivalent structures(E-PEG), that could capture multiple
PEGs in the same structure. The resulting E-PEG is similar to the one in Figure 2.12,
where dashed edges indicate equivalences. It is notable that each edge allows a binary
choice, therefore the number of PEGs that can be represented in an E-PEG could be
exponential in the number of equivalence edges.






















Figure 2.12. An simple E-PEG example, taken from [Tat+].
By repeatedly applying transformation rules to append all possible equivalent structures
to the E-PEG, this graph will eventually saturate, i.e. no more equivalent structure can be
added to the graph because all possible equivalences are now discovered. This process
and the resulting E-PEG is more space-time efficient than enumerating all possible PEGs
along the path, because E-PEG encourages sharing common subgraphs, even across
equivalent edges. This saturated graph can always be produced regardless of in what
order we apply the passes, hence preventing the phase-ordering problem. Furthermore,
E-PEG defers the decision of whether an optimization should be committed until we
have reached full saturation, allowing the global optima to be discovered. In contrast,
because each optimization pass in a conventional compiler is performed consecutively,
the compilers must make the decision to commit changes immediately after applying the
optimization, which consequently often results in local optima.
2.5 Discovering Equivalent Programs
In this section, we explain existing optimization methods to restructure numerical pro-
grams with arithmetic equivalences. Because general numerical programs—consisting of
program statements, conditional branches and loops—generalize arithmetic expressions,
we start by introducing optimization methods for expressions, followed by those for
general numerical programs.
72 Chapter 2 Background
2.5.1 Improving Performance by Rewriting Arithmetic Expressions
Software Compilers
It is common knowledge to software programmers that a typical optimizing compiler,
such as GNU Compiler Collection (GCC) [St16] or Clang [Cla16], has some traditional
static analysis-based optimization passes such as dead code elimination, loop strength
reduction and constant propagation. These optimization passes, however, limit themselves
by producing implementations that do not impact numerical accuracy, i.e. they compute
the same output given identical inputs.
A less well-known fact about these compilers is that they also support a variety of
optimization passes that are not enabled by default. These options, when enabled, allow
the compiler to yield faster software implementations for programs with a large proportion
of floating-point arithmetic computations. These optimization passes rewrite arithmetic
expressions into more efficient alternative forms which are equivalent to the originals
in real arithmetic, but when executed on a machine they compute different results. The
reason for the differences is that arithmetic in machines has finite precision, computed
results must be rounded to the nearest representable values. These discrepancies, when
accumulated, could potentially result in wildly inaccurate outputs.
There are a number of compiler options in GCC [St16] that specifically perform the above
optimizations. For example, the following options exist to enable expression-rewriting
heuristics:
• -fassociative-math enables arithmetic expression rewriting by associativity,
one of the heuristics applied is to perform exponentiation by squaring, e.g. an expres-
sion x*x*x*xwhich requires 3 multiplications, can be optimized as (x*x)*(x*x),
reducing the number of multiplications to 2 by sharing the value of the subexpres-
sion x*x;
2.5 Discovering Equivalent Programs 73
• -freciprocal-math can be used to rewrite x/y into x ∗ (1/y), if 1/y can be
commonly shared among subexpressions; and
• -fno-signed-zeros ignores the signedness of floating point zeros, for example,
0.0f and -0.0f are identical, so that 0.0f*x can be simplified to 0.0 without
concerning us with the signedness of the result.
An extra optimization option, which encompasses the above options, can be used to
enable them all together; it further highlights theunsafe nature of these transformations
in its name, i.e. -funsafe-math-optimizations. Besides GCC, Clang, which uses
the LLVM framework, provides a similar option, -enable-unsafe-fp-math to use
arithmetic equivalences to reduce the number of floating-point operations in a program,
by possibly sacrificing numerical accuracy.
High-Level Synthesis Tools
In addition to the unsafe arithmetic expression rewriting heuristics inherited from the
LLVM framework, VHLS [Xil12] and LegUp [LU], which are both LLVM-based, can
make use of hardware-specific expression rewriting optimization passes to allow greater
parallelism in synthesized circuit, thus improving throughput. By way of illustration,
Xilinx’s VHLS has a similar feature called expression balancing [Xil12], which aims to
balance an arithmetic expression tree using associativity. A technique known as tree
height reduction [NP91] further incorporates distributivity and control-flow rewriting.
However, neither of these methods produces optimal loop pipelining, as they do not
examine the implications of loop-carried dependences. For example, a loop body:
sum = ((sum + A[i]) + B[i]) + C[i];
when synthesized in VHLS with expression balancing, will produce a schedule which
corresponds to the following:
sum = (sum + A[i]) + (B[i] + C[i]);
74 Chapter 2 Background
This loop is more efficient than the original, because it has a delay of 2 adders between
consecutive iterations, instead of the original 3-adder delay, in the inter-iteration depen-
dences of sum. However, as we will see below, there is still room for improvement.
Canis et al. [Can+14] propose a similar approach called recurrence minimization. They
specifically tackle loop pipelining by incrementally restructuring dependence graphs to
minimize longest paths of recurrences. Their method is subsequently incorporated in
LegUp [LU]. For instance, by synthesizing the same original loop in LegUp, it detects
that there are inter-iteration dependences between each pair of sum from consecutive
iterations. The tool will therefore minimize the latency between these dependences by
using associativity to restructure the expression. This optimization produces a schedule
which corresponds to the following loop body:
sum = sum + ((A[i] + B[i]) + C[i]);
It is notable that, by further delaying the addition ofsum, the loop now has only a delay
of 1 adder between consecutive iterations. This technique can greatly reduce the run
time of pipelined loops, especially if the inter-iteration dependences has a long chain of
additions. However, similar to VHLS, they only apply associativity in their restructuring
without regard for accuracy.
Polynomial Factorization
The above mentioned tools restrict themselves to simple arithmetic equivalences, as they
are intended for fast and simple optimizations that can easily be applied by a compiler.
Several techniques take one step further, by focusing on multivariate polynomials, and
factorizing them to minimize the number of arithmetic operations in an expression. These
approaches are applicable to both software and hardware designs, as they minimize
the number of operations in an expression, the throughput of the optimized design can
be reduced. In addition, in an FPGA circuit, this will also translate to a reduction of
resources utilized.
2.5 Discovering Equivalent Programs 75
It is well known that the Horner scheme is the optimal way to factorize a univariate poly-
nomial so that its operator count is minimized [Neu01], e.g. a polynomial x3+2x2+3x+4,
which uses 4 multipliers and 3 adders if common subexpressions are eliminated, can be
factorized into x(x(x+2)+3)+4, which uses 2 fewer multipliers. However, a multivariate
polynomial could be expressed in multiple ways using the Horner scheme, and finding
the optimal one is a difficult problem. Ceberio et al. [CK04] therefore propose a greedy
algorithm to efficiently factorize a multivariate polynomial to overcome this. Hosan-
gadi et al. [Hos+04] propose an algorithm for the factorization of polynomials, in order to
eliminate common subexpressions, and subsequently reduce addition and multiplication
counts for a faster software implementation. However it is not possible to choose different
optimization levels with their method. Peymandoust et al. [PDM01] present an approach
that only deals with the factorization of polynomials in HLS using Gröbner bases. The
weaknesses of this are its dependence on a set of library expressions [Hos+04] and the
high computational complexity of Gröbner bases.
Shortcomings
The above approaches utilize a number of heuristics to rewrite expressions, and do not
explore all possible rewrites in full by taking into account additional equivalence relations.
This could limit their applicability to a small number of special cases. The implemen-
tation obtained therefore is often likely to be suboptimal, as further improvements are
possible. More importantly, none of the above mentioned techniques and tools aim to
minimize, or even analyze, the impact of their transformations on numerical accuracy.
HLS tools therefore generally disable these unsafe features by default for floating-point
computations. It is in general a good practice to avoid them, if numerical accuracy is a
critical factor to ensure the correctness of the application being compiled.
2.5.2 Rewriting Arithmetic Expressions for Accuracy
In many numerically sensitive programs, small round-off errors, when accumulated,
would result in catastrophically inaccurate results. In particular, Panchekha et al. [Pan+15]
76 Chapter 2 Background
show that inaccurate computations, due to round-off errors, resulted in the retraction of
scientific articles [AM99], and even wild discrepancies in stock market indices [BDM99].
In response, the software community has seen an emergence of techniques that rewrite
expressions in numerical programs, to ameliorate round-off errors in numerically sensitive
applications. However, round-off errors in numerical programs are well known to be
perplexing to debug [TM14], and it is in general difficult to apply intuition manually
and rewrite programs to optimize numerical accuracy [TM14]. The numerical accuracy
optimization techniques therefore often explore the search space of equivalent expres-
sions, using equivalence relations in real arithmetic, ranging from the simplest possible,
e.g. associativity, commutativity and distributivity, to more sophisticated ones, such as
trigonometry facts, equivalence rules that are known to sometimes improve accuracy
x− y = x2−y2x+y [Pan+15], and many more.
As the vastness of the search space prohibits us to explore it fully, existing approaches
confronting this problem resort to heuristics.
Darulova et al. [Dar+13] use genetic programming to evolve the structure of arithmetic
expressions into more accurate forms. However there are several disadvantages with
metaheuristics, such as convergence can only be proved empirically and scalability is
difficult to control because there is no definitive method to decide how long the algorithm
must run until it reaches a satisfactory goal.
The method proposed by Martel [Mar07] is based on operational semantics with abstract
interpretation, but even their depth limited strategy is, in practice, at least exponentially
complex.
Ioualalen et al. [IM12], create a polynomial-size structure, APEG, to represent an expo-
nential number of equivalent expressions related by rules of equivalence. However it
restricts itself to only a handful of these rules to avoid combinatorial explosion of the
structure and offers no options for tuning its optimization level.
Panchekha et al. [Pan+15] present a tool, Herbie, which employs a greedy hill-climbing
heuristic to iteratively rewrite expressions in locations that introduce the largest round-off
2.5 Discovering Equivalent Programs 77
errors. Here, detailed overviews are provided for two distinct approaches, APEG and
Herbie.
APEG
The APEG proposed by Ioualalen et al. [IM12] was inspired by E-PEG, originally intro-
duced by Tate et al. [Tat+]*. E-PEGs, originally intended to discover equivalent structure
in programs but not arithmetic expressions, nevertheless include equivalence rules that
are similar to the ones that exist in real arithmetic. For instance, distributivity can by
applied over nodes such as θ and eval nodes.
The APEG is similar in construction to the E-PEG, which is also a graph-based structure,
and promotes the maximal sharing of common subtrees. The major difference is that
because of the lack of control-flow in the program, APEGs are acyclic, and do not consist
of nodes that correspond to control-flows. APEGs subsume arithmetic expressions by
allowing all possible nodes in an arithmetic expression, i.e. it can contain leaf nodes
such as a constant value or a variable identifier, unary or binary arithmetic operators
that respectively have one or two child nodes. In addition, to allow an APEG to encode
an exponential number of equivalent expressions within a polynomial space, it further
introduces two different kinds of nodes that efficiently model equivalences.
Firstly, the APEG defines a node containing an abstraction box, ⊗, (p1, p2, . . . , pn) , where
⊗ is a commutative associative operator such as addition + or multiplication ×, and
p1, p2, . . . , pn are the children of this node. The abstraction box can be used to represent
equivalent expressions generated by different associative parsings of the expression
p1⊗ p2⊗ . . .⊗ pn. To illustrate, an expression a+ b+ c can be encoded by the abstraction
box +, (a, b, c) , which can represent all equivalent parsings of the original expression,
i.e. (a + b) + c, a + (b + c) and (a + c) + b. Each abstraction box with n children can
represent up to (2n− 3)!! equivalent expressions without commutativity [IM12; Mou11],
as commutativity does not impact numerical accuracy for commutative operators.
*E-PEG and equality saturation is discussed in Section 2.4.2.
78 Chapter 2 Background
Secondly, in an analogous fashion to the E-PEG, the APEG additionally admits a new
kind of node which can be used to enclose a set of equivalent subexpressions. Defined as
〈p1, p2, . . . , pn〉, the equivalent class is a node which signifies all children subtrees with
root nodes p1, p2, . . . , pn represent expressions that are mathematically equivalent. This
is analogous to forming equivalence edges in E-PEGs.











Figure 2.13. An example APEG for the expression ((a + a) + b)× c, from [Mar12].
Take Figure 2.13 as our example, which not only contains standard tree nodes which
we would expect from an arithmetic expression tree, but also an abstraction box and
equivalent classes. Each equivalent class is represented by an ellipse with a dashed border.
It is evident that each equivalent class indeed reflects a set of equivalent subtrees. For
instance, consider the bottom equivalent class which has two equivalent subtrees, the left
and right ones represent 2× a and a + a respectively. In addition, as the subexpression
(a + a) + b is a summation of multiple elements, an abstraction box +, (a,a,b) can
therefore model the equivalent parsings of it.
Despite the compactness of APEGs, searching for the most accurate expression within
an APEG is still exponentially complex, where the search utilizes the static analysis of
floating-point errors introduced in Section 2.3. For instance, consider an APEG with
n equivalent classes, where each contains 2 different choices. In the worst case, this
may amount to a search of 2n distinct equivalent expressions to determine the optimal
solution. To resolve this problem, Ioualalen et al. [IM12] employs a depth-limited search
heuristic to reduce the size of the search space. Additionally, as explained earlier, the
time complexity to search for the optimal parsing of an abstraction box is double factorial.
They hence make use of another heuristic, which greedily pairs up terms in an abstraction
2.5 Discovering Equivalent Programs 79
box B = +, (p1, p2, . . . , pn) . To begin, the method searches for a pair of child subtrees pi
and pj in B, such that the expression pi ⊗ pj computes with the smallest round-off error.




then added to B. The above greedy pairing procedure is then iteratively repeated until
finally one node is left in B; this last node is therefore the expression with an improved
numerical accuracy obtained in the process.
Ioualalen et al. [IM12] found their APEG approach can reduce round-off errors in ex-
pressions by up to 50%. They further proved the correctness of their APEG approach by
forming a Galois connection:
℘
(LeMB) −−→←−−αγ ΠB, (2.55)
where ℘
(LeMB) is all possible traces of rewrites of an initial expression e using associativity,
distributivity and commutativity, and ΠB is the final saturated APEG.
Herbie
Unlike the APEG approach which focuses on building an IR that can represent exponen-
tially many equivalent expressions, Herbie, introduced by Panchekha et al. [Pan+15], on
the other hand, place emphasis on building an algorithm to efficiently improve numerical
accuracy by arithmetic rewrite rules.
Initially, for a given expression e, Herbie samples a randomized set of input values used
to evaluate it, and we use I to denote this set. In the optimization process, Herbie
analyzes the discrepancy between evaluating an expression exactly, and in a given
floating-point format with precision pˆ, due to round-off errors, for a common input i ∈ I.
However, because expressions contain irrational computations such as square roots, and
transcendental functions, e.g. trigonometry, logarithm, etc., exact computation is generally
unattainable as it could require infinite precision in floating-point. Instead, Herbie resorts
to GNU Multiple Precision Floating-Point Reliable Library (MPFR), and iteratively increases
the precision used to compute the approximate results for e, until the leading 64 bits
in the result remain identical across iteration for all randomly sampled inputs i ∈ I.
80 Chapter 2 Background
Evaluating e in the final precision p, therefore produces an almost exact (AE)‡ result for e.
It is notable that in formulae, the term xˆ over a x is used to denote an approximation of
x, whereas the absence of it indicates an AE result.
An iterative accuracy refinement procedure is then carried out to optimize an initial
expression e0. This process identifies operators in the expression e0 which contribute the
largest round-off errors, and specifically applies arithmetic rewriting to these operators,
and repeats the process for the resulting expressions. This procedure is carried out for a
predetermined number N of repeats (Herbie uses N = 3) as follows.
To begin, for each operator in an original expression e, we accumulate all round-off errors
for all sample inputs i ∈ I. For instance, consider a binary operator + which accepts
two input arguments e1 and e2, where each argument is a subexpression. To guarantee
the correctness of the round-off error analysis of +, its subexpressions are evaluated
into respective AE values v1 and v2 for a given input i. Using v1 and v2 as inputs, the
operator + is then evaluated twice, in precision pˆ and in AE precision p to produce vˆi and
vi respectively, where vˆi and vi are respectively the approximate result of evaluating the
addition in floating-point with inputs v1 and v2. The error between x = vˆi and y = vi is
defined as follows [Pan+15; Sch+14], which counts the number of floating-point values
between the approximate and exact results§:
E(x, y) = log2 |{z ∈ F | min(x, y) ≤ z ≤ max(x, y)}| , (2.56)
The E(vˆi, vi) is then accumulated for all sample inputs i ∈ I, and the final sum is the total





The top M operators with the largest errors are then selected to be rewritten in the next
stage. Panchekha et al. use M = 4 in [Pan+15].
‡The original paper [Pan+15] uses “exact” to describe what is in fact a floating-point value with a precision
p, which approximates the exact value. Because this could potentially mislead, “almost exact” is used
here instead.
§F is the set of floating-point values in a precision which was not mentioned by the original author [Pan+15].
2.5 Discovering Equivalent Programs 81
The second step is to rewrite the arithmetic expression, using the information gathered in
the previous step about the largest local errors in the expression. The authors of Herbie
identify that if subexpression of the selected operators are rewritten, transforming the
selected operators’ expression tree can provide greater chances of cancelling terms, hence
significantly reduce the round-off errors. The rewriting process therefore performs a re-
cursively bottom-up transformation, which first rewrites the smaller subexpressions, then
applies transformation rules to the selected operators. In contrast to APEGs, Herbie does
not restrict its database of rewrites to the basic arithmetic rules such as associativity and
distributivity, as they also admit expressions with fractions, square roots, exponentiation,
logarithm and trigonometric functions; additional rules thus must be provided for their
corresponding algebraic identities. In addition, Herbie has rules for special polynomial
patterns. For instance, it may rewrite a3 + b3 into (a + b)(a2 − ab + b2).
The third step is to further simplify the expression generated from the rewriting step. In
this step, a set of simplification rules are applied to the expression for a small number of
iterations, which transforms the original into a new expression. However, unlike step 2,
as new expressions are discovered, an equivalence graph is gradually constructed. An
expression with the smallest tree structure can then be extracted from the graph as the
simplified result.
Herbie further identifies that some of these expressions, when evaluated, could produce
highly inaccurate outcomes near zero or infinity. In the fourth step, it will then substitute
the expression with a series expansion to approximate the original, which may no longer
suffer from this problem. This series expansion can be further restructured in future
iterations of refinement.
As the above process produce multiple candidate expressions, in the next iteration of
the accuracy refinement procedure, the four steps above are applied to all candidate
expressions, and these steps will be repeated N times in total, generating a set of accurate
expressions that are mathematically equivalent to the original. Finally, because different
candidate expressions may excel on different input conditions, a dynamic programming
algorithm, is therefore applied to split the input space into multiple parts, where each
82 Chapter 2 Background
part can be computed by different optimized expressions. As an example, consider the




Their technique can optimize this expression to generate a scheme of expressions to be


















if 10127 < b.
(2.59)
Note that the third alternative, − ba+ cb , is not equivalent to the other two in real arithmetic,
because series expansion is applied to approximate the square root.
Panchekha et al. [Pan+15] discovered that their tool can improve the numerical accuracy
of a suite of benchmark examples from [Ham86], by recovering up to 60 bits of floating-
point precision, with a performance overhead of 40% for the generated expressions.
2.5.3 Numerical Programs
The techniques we have explored so far have only been limited to individual arithmetic
expressions; for a complete numerical program transformation, not only is it necessary
to support sequential execution of straight-line code, but also control-flow structures
such as conditional branches and loops. However, the techniques in this research area to
optimize numerical accuracy are currently not mature enough to be used in compilers
and HLS tools.
Tate et al. [Tat+] with their E-PEGs, enable equivalent programs to be discovered effi-
ciently. However, their method does not consider the implications of rewriting arithmetic
expressions in the data-flow on numerical accuracy.
2.5 Discovering Equivalent Programs 83
Damouche et al. [Dam+15] developed a new semantics-based method, built on top of
the foundation of APEGs from Ioualalen et al. [IM12], for the purpose of accuracy opti-
mization in general numerical programs. Their method formally defines a set of inference
rules for rewriting a sequent, a formula used encode the original program, into another
one representing an optimized program. To summarize from an informal standpoint,
the inference rules analyze the basic blocks in the numerical program, which consist of
arithmetic operations and variable assignments, into standard arithmetic expressions.
These arithmetic expressions can then be analyzed and optimized by constructing APEGs
and extracting an optimized expression from the APEGs. Because they only focus on
the basic blocks of the program, containing only data-flows, their approach is not able
to optimize across control-flow boundaries such as if statements and while loops.
Their method can improve numerical accuracy by approximately 20% for simple loop
examples.
To summarize, the lack of the ability to trade-off multiple performance objectives in
program optimization provides a strong motivation for the work in this thesis. We believe
that rewriting a numerical program could not only improve its numerical accuracy, but
also at the same time increase its throughput while optimizing for resource utilization. In
addition, as mentioned earlier, all above program optimization methods have their short-
comings, they thus form the basis for this thesis to develop an optimization framework
which avoids these disadvantages. Similar to these program optimization techniques, this
thesis presents a method which builds on top of the formal semantics of programs. This
ensures the correctness of program optimization.
84 Chapter 2 Background
3Structural Optimization of Arithmetic
Expressions
By exploiting rules of equivalence in arithmetic, such as associativity (a + b) + c ≡
a+ (b+ c) and distributivity (a+ b)× c ≡ a× c+ b× c, it is possible to automatically
generate different implementations of the same arithmetic expression. We optimize the
structures of arithmetic expressions in terms of the following two quality metrics relevant
to FPGA implementation: the resource usage when synthesized into circuits, and a bound
on roundoff errors when evaluated. Our goal is the joint minimization of these two quality
metrics. This optimization process provides a Pareto optimal set of implementations. For
example, the tool discovers that with single-precision floating-point representation, if
a ∈ [0.1, 0.2], then the expression (a + 1)2 uses fewest resources when implemented in
the form (a+1)× (a+1) but is most accurate when expanded into (((a×a)+a)+a)+1.
However it turns out that a third alternative, ((1 + a) + a) + (a× a), is never desirable
because it is neither more accurate nor uses fewer resources than the other two possible
structures. Our aim is to automatically detect and utilize such information to optimize
the structure of expressions.
A naïve implementation of equivalent expression finding would be to explore all possible
equivalent expressions to find optimal choices. However, this would result in combi-
natorial explosion [IM12]. Since none of the techniques explained in Section 2.5 of
Chapter 2 capture the optimization of both accuracy and performance by restructuring
arithmetic expressions, we base ourselves on the software work of Martel [Mar07], but
extend their work in the following ways. Firstly, we develop new hardware-appropriate
semantics to analyze not only accuracy but also resource usage, seamlessly taking into
account common subexpression elimination. Secondly, because we consider both resource
usage and accuracy, we develop a novel multi-objective optimization approach to scalably
85
construct the Pareto frontier in a hierarchical manner, allowing fast design exploration.
Thirdly, equivalence finding is guided by prior knowledge on the bounds of the expression
variables, as well as local Pareto frontiers of subexpressions while it is optimizing expres-
sion trees in a bottom-up approach, which allows us to reduce the complexity of finding
equivalent expressions without sacrificing our ability to optimize expressions. Following
Martel [Mar07] and Ioualalen et al. [IM12], the methodology explained in this chapter
makes use of formal semantics as well as abstract interpretation [CC77] to significantly
reduce the space and time requirements and produce a subset of the Pareto frontier.
In order to further increase the options available in the Pareto frontier, we introduce
freedom in choosing mantissa widths for the evaluation of the expressions. Generally
as the precision of the evaluation increases, the utilization of resources increases for
the same expression. This gives flexibility in the trade-off between resource usage and
precision. Our approach and its associated tool, SOAP, allow high-level synthesis flows to
automatically determine whether it is a better choice to rewrite an expression, or change
its precision in order to meet optimization goals.
The three contributions of this chapter are:
1. Efficient methods for discovering equivalent structures of arithmetic expressions.
2. A semantics-based program analysis that allows joint reasoning about the resource
usage and safe ranges of values and errors in floating-point computation of arith-
metic expressions.
3. A tool which produces RTL implementations on the area-accuracy trade-off curve
derived from structural optimization.
This chapter is structured as follows. Sections 3.1 and 3.2 extend the basic concepts of
semantics with abstract interpretation explained in Section 2.3 to respectively analyze
accuracy and resources. Section 3.3 discusses various abstract semantics for finding
equivalent structure in arithmetic expressions, as well as the analysis of their resource
usage estimates and bounds of errors. Section 3.4 gives an overview of the implementa-
86 Chapter 3 Structural Optimization of Arithmetic Expressions
tion details in SOAP. Finally, we discuss the results of optimized example expressions in
Section 3.5 and end with concluding remarks for this chapter in Section 3.6.
3.1 Accuracy Analysis
The accuracy analysis used by SOAP follows the method based on abstract error domain
introduced by Martel [Mar07] to analyze the round-off error of restructured floating-point
expressions. As it was mentioned in Section 2.3.4, they did not have a preference for the
choice of definition of ulp. Hence, here we propose to use a definition of ulp derived from
the standard IEEE 754 floating-point representation.
To begin, the concepts of the floating-point representation [ANS08] should be intro-
duced. Any values v representable in floating-point with standard exponent offset can be
expressed with the format given by the following equation:
v = s× 2e−(2k−1−1) × 1.m1m2m3 . . .mp. (3.1)
In (3.1), the bit s is the sign bit, the k-bit unsigned integer e is known as the exponent
bits, and the p-bits m1m2m3 . . .mp are the mantissa bits, here we use 1.m1m2m3 . . .mp
to indicate a fixed-point number represented in unsigned binary format.
Note that a non-zero floating-point with the smallest magnitude can be found by setting
e = 0 and m1,m2, . . . ,mp to 0, which may still be far away from 0 when compared to
the next non-zero floating-point value, where e = 0, m1,m2, . . . ,mp−1 are all 0, and
mp = 1. To solve this discontinuous behaviour, extra logic can be used to allow a different
floating-point representation when e = 0:
v = s× 2−(2k−1−1) × 0.m1m2m3 . . .mp. (3.2)
This alternative behaviour is known as gradual underflow, whereas the original is abrupt
underflow.
3.1 Accuracy Analysis 87
In SOAP, the distance between two adjacent floating-point values f1 and f2 satisfying
f1 ≤ x ≤ f2 for a value x [Gol91] is known as the unit of the last place function ulp(x).
To characterize this function, we further restrict that f1 and f2 must not equal and no
other floating-point values exists between them. We can now provide the definition for
ulp(x), where GU and AU respectively indicate gradual and abrupt underflow modes:
Definition 3.1. In our analysis, the function ulp(x) is defined as:
ulp(x) =

∞, if x is −∞ or∞,
2e(x)−(2k−1−1) × 2−p, if GU and x is not −∞ or∞ ,
max
(
2−(2k−1−1), 2e(x)−(2k−1−1) × 2−p
)
if AU and x is not −∞ or∞ .
(3.3)
where e(x) is the exponent of x, k and p are the parameters of the floating-point format as
defined in (3.1).
Since Martel [Mar07] does not define the arithmetic operator for division. The following
equations for division are therefore introduced in this thesis. Firstly, divisions on intervals
can be implemented as follows:
[a, b]
[c, d]
:= [min(s),max(s)] , (3.4)
where [a, b] , [c, d] ∈ Interval and:
s =




























and the round-off error introduced






























88 Chapter 3 Structural Optimization of Arithmetic Expressions





computes the range of floating-point









returns the range of errors introduced in
the rounding process.
We use the function Error : AExpr → E] to represent the analysis of round-off error
in an expression tree, as described in Section 2.3.4 of Chapter 2, using the above ulp
equation in Definition 3.1, where AExpr denotes the set of all arithmetic expressions.
For each expression in a set of equivalent expressions discovered, e ∈ , each expression
eevaluates to a distinct value in the abstract error domain. Section 2.3.4 of Chapter 2
presents a method to compare against each other with a partial ordering. However, a
total ordering is much more preferable, as all expressions can be easily compared against
one another. In SOAP, the following function AbsError is used to convert an evaluated
outcome v ∈ E] into a scalar to denote the magnitude of round-off error:
AbsError(e) = max
(∣∣∣µ]min∣∣∣ , ∣∣∣µ]max∣∣∣) , where (x], [µ]min, µ]max]) = Error(e). (3.7)
3.2 Resource Usage Analysis
Here we define similar formal semantics which calculate an approximation to the FPGA
resource usage of an expression, taking into account common subexpression elimination.
This is important as, for example, rewriting a × b + a × c as a × (b + c) in the larger
expression (a× b + a× c) + (a× b)2 causes the common subexpression a× b to be no
longer present in both terms. Our analysis must capture this.
The analysis proceeds by labelling subexpressions. Intuitively, the set of labels Label,
is used to assign unique labels to unique expressions, so it is possible to easily identify
and reuse them. For convenience, let the function fresh : AExpr → Label assign a
distinct label to each expression or variable, where AExpr is the set of all expressions. It
is noteworthy that fresh is a bijection. Before we introduce the labeling semantics, we
define the environment λ : Label→ AExpr ∪ {⊥}, which is a function that maps labels
3.2 Resource Usage Analysis 89
to expressions, and Env denotes the set of such environments. A label l in the domain
of λ ∈ Env that maps to ⊥ indicates that l does not map to an expression. An element
(l, λ) ∈ Label×Env stands for the labeling scheme of an expression. Initially, we map
all labels to ⊥, then in the mapping λ, each leaf of an expression is assigned a unique
label, and the unique label l is used to identify the leaf. That is for the leaf variable or
constant x:
(l, λ) = (fresh(x), [fresh(x) 7→ x]). (3.8)
This equation uses [fresh(x) 7→ x] to indicate an environment that maps the label fresh(x)
to the expression x and all other labels map to ⊥, in other words, if l = fresh(x) and
l′ 6= l, then λ(l) = x and λ(l′) = ⊥.
For example, consider the expression (a + b)2 = (a + b)× (a + b), we initially have for
the variables a and b:
(la, λa) = (fresh(a), [fresh(a) 7→ a]) = (l1, [l1 7→ a]),
(lb, λb) = (l2, [l2 7→ b]).
(3.9)
Then the environments are propagated in the flow direction of the DFG, using the
following formulation of the labeling semantics:
(lx, λx)⊗ (ly, λy) = (l, (λx  λy)[l 7→ lx ⊗ ly]),
where l = fresh(lx ⊗ ly),⊗ ∈ {+,−,×}.
(3.10)
Specifically, λ = λx  λy signifies that λy is used to update the mapping in λx, if the
mapping does not exist in λx, and results in a new environment λ; and λ[l 7→ x] is a
90 Chapter 3 Structural Optimization of Arithmetic Expressions
shorthand for λ [l 7→ x]. As an example, using (3.9), recall to mind that l1 = la, l2 = lb,
we derive for the subexpression a + b:
(la+b, λa+b) = (la, λa) + (lb, λb)
= (l3, (λa  λb) [l3 7→ la + lb])
where l3 = fresh(la + lb)
= (l3, [l1 7→ a] [l2 7→ b] [l3 7→ l1 + l2])
= (l3, [l1 7→ a, l2 7→ b, l3 7→ l1 + l2]),
(3.11)
where la + lb is a syntactic construct to signify that the subexpressions with labels la and
lb are added to form an expression. Finally, for the full expression (a + b)× (a + b):
(l, λ) = (la+b, λa+b)× (la+b, λa+b)
= (l4, [l1 7→ a, l2 7→ b, l3 7→ l1 + l2, l4 7→ l3 × l3]).
(3.12)
From the above derivation, it is clear that the semantics capture the reuse of subexpres-
sions. The estimation of area is performed by counting, for an expression, the numbers
of additions, subtractions and multiplications in the final labeling environment, then
calculating an approximation to the number of LUTs used to synthesize the expression by
adding the LUTs requirement for each operator. If the number of operators is n⊗ where
⊗ ∈ AOp, and AOp denotes the set of arithmetic operators, then the number of LUTs in





where the value RLUT⊗ denotes the number of LUTs per ⊗ operator, which is dependent
on the type of the operator and the floating-point format used to generate the operator.
Despite the simplicity of the resource estimation model, it is observed that this model can
accurately predict the LUT count of the synthesized circuits, as the resource reduction
from cross-module optimizations are negligible for floating-point data-paths. In later
chapters, we will further demonstrate how our approach can be extended to provide
estimate DSP block counts additionally.
3.2 Resource Usage Analysis 91
In the following sections, we use the function Area : AExpr→ N to denote our resource
usage analysis.
3.3 Equivalent Expressions Analysis
In Section 2.3.4 of Chapter 2, we introduce semantics that define additions and multipli-
cations on intervals, then transition to error semantics that compute bounds of values and
errors. In Section 3.2, an alternative semantics that eliminate common subexpressions
are introduced, which we call the labelling environments. The environments define the
meaning of arithmetic operations on the environments. In this section, we now take
the leap from not only analyzing an expression for its quality, to defining arithmetic
operations on sets of equivalent expressions, and use these rules to discover equivalent
expressions. Before this, it is necessary to formally define equivalent expressions and the
functions used to discover them.
3.3.1 Discovering Equivalent Expressions
From an expression, a set of equivalent expressions can be discovered by our equivalence
relation ≡ on the set of all arithmetic expressions AExpr, and ≡ is a strict subset of
AExpr×AExpr. It is noteworthy that a relation is said to be an equivalence relation
when it is reflexive, symmetric and transitive, i.e. for all e1, e2, e3 ∈ AExpr, we have the
following rules in our inference system:
[Reflexivity] e1 ≡ e1,
[Symmetry] e1 ≡ e2e2 ≡ e1 ,
[Transitivity] e1 ≡ e2 ∧ e2 ≡ e3e1 ≡ e3 .
(3.14)
Here, a rule of the form a
b
means that if the formula a is true, then b also holds.
We extend our inference system with additional arithmetic rules that relate equivalent
92 Chapter 3 Structural Optimization of Arithmetic Expressions
expressions. Let’s define e1, e2, e3 ∈ AExpr, v1, v2, v3 ∈ R. Firstly, we start with a set of
associativity rules:
[Assoc1] ∀⊗ ∈ {+,×} : (e1 ⊗ e2)⊗ e3 ≡ e1 ⊗ (e2 ⊗ e3),
[Assoc2] ∀⊗ ∈ {+,×, /} : (e1 ⊗ e2)⊗ e3 ≡ (e1 ⊗ e3)⊗ e2,
[Assoc3(/)] (e1/e2)/e3 ≡ e1/(e2 × e3),
[Assoc4(/)] e1/(e2/e3) ≡ e1 × e3/e2.
(3.15)
Secondly, we define commutativity rules for addition and multiplication:
[Commut] ∀⊗ ∈ {+,×} : e1 ⊗ e2 ≡ e2 ⊗ e1. (3.16)
Thirdly, distributivity rules enable further equivalent expressions to be explored with our
relation:
[Distrib1] ∀⊗ ∈ {×, /} : e1 ⊗ e3 + e2 ⊗ e3 ≡ (e1 + e2)⊗ e3,
[Distrib2(×)] e1 + e1 × e2 ≡ (1 + e2)× e1,
[Distrib3(×)] e1 + e1 ≡ 2× e1,
[Distrib4(/)] e1 + e2/e3 ≡ (e1 × e3 + e2)/e3,
[Distrib5(/)] e1/e3 + e2/e4 ≡ (e1 × e4 + e2 × e3)/(e3 × e4),
[Distrib6(−)] ∀⊗ ∈ {+,−} : −(e1 ⊗ e2) ≡ (−e1)⊗ (−e2),
[Distrib7(−)] ∀⊗ ∈ {×, /} : −(e1 ⊗ e2) ≡ (−e1)⊗ e2 ≡ e1 ⊗ (−e2).
(3.17)
The following rule rewrites an expression with subtraction into an addition:
[Subtract] e1 − e2 ≡ e1 + (−e2). (3.18)
The following reduction rules propagates constant values in expression trees. The
ConstProp1 rule states that if an expression is an arithmetic operation of two constant
3.3 Equivalent Expressions Analysis 93
values, then it can be simply evaluated to produce the result; and ConstProp2 rewrites
an expression that has a negation on a constant value into a single constant.
[ConstProp1]
(v3 = v1 ⊗ v2) ∧ (⊗ ∈ BinAOp)
v1 ⊗ v2 ≡ v3 ,
[ConstProp2(−)] v2 = −v1−v1 ≡ v2 .
(3.19)
Here we define BinAOp = {+,−,×, /} to be the set of binary arithmetic operators.
We also define another set of reduction rules looking for common patterns that can be
simplified:
[Identity1] ∀⊗ ∈ {×, /} : e1 ⊗ 1 ≡ e1, [Elim1(−)] e1 − e1 ≡ 0,
[Identity2] ∀⊗ ∈ {+,−} : e1 ⊗ 0 ≡ e1, [Elim2(/)] e1/e1 ≡ 1,
[ZeroProp(×)] 0× e1 ≡ 0, [DoubleNeg(−)] − (−e1) ≡ e1.
(3.20)
Finally, the following two allow structural induction on expression trees, e.g. it is possible
to derive that a+ (b+ c) ≡ a+ (c+ b) from b+ c ≡ c+ b:
[Tree1]
(e1 ≡ e2) ∧ (e3 ≡ e4) ∧ (⊗ ∈ BinAOp)
e1 ⊗ e3 ≡ e2 ⊗ e4 ,
[Tree2]
e1 ≡ e2
−e1 ≡ −e2 .
(3.21)
We say that e1 is equivalent to e2 if and only if e1 ≡ e2 For some expressions e1 and e2.
Many rules are considered redundant and thus excluded from our equivalence relation,
as these can be derived by combining rules from our inference system. For instance,
e1 × (e2 + e3) ≡ e1 × e2 + e1 × e3 can be derived by using Distrib1, in tandem with the
Commut rule.
3.3.2 Scalable Methods for Rewriting
The above rules of equivalence relate an expression with all of its equivalent expressions.
In general because of combinatorial explosion, the set of all equivalent expressions is so
large to be derived, which motivates us to develop scalable methods that execute fast
enough even with large expressions.
94 Chapter 3 Structural Optimization of Arithmetic Expressions
Instead of deriving the full set of equivalent expressions, we can define a new relation
 , a subset of ≡, which is identical to our equivalent relation ≡ except that we place
a few restrictions on the relation. This new relation can be generated by removing
the equivalence relation rules in (3.14). Firstly, reflexivity can be removed, because
it is not necessary to rediscover expressions. Secondly, we disable transitivity from ≡,
as we can have the flexibility to apply  in a series of steps to generate equivalent
expressions. Finally, we further disallow symmetry in (3.14) for the reduction rules
in (3.19) and (3.20) to reduce the space-time complexity of the search space, because
often performance metrics, such as accuracy and resource usage, improve when the
number of terms in the expression is reduced.
To make use of the new relation we define the following category of functions:
Definition 3.2. We call a function an equivalent expression generator (EEG) function
if and only if the function takes as an input an initial set of equivalent expressions, and
generates another set of expressions equivalent to those in the input set.
For instance, an EEG function I : ℘ (AExpr≡) → ℘ (AExpr≡), where ℘ (AExpr≡)
denotes the power set of all equivalent expressions AExpr≡, can be defined as follows:
I() =
{
e′ ∈ AExpr | e e′ ∧ e ∈ } , (3.22)
where  is a set of equivalent expressions.
We define a functional:
clN : (℘ (AExpr≡)→ ℘ (AExpr≡))→ (℘ (AExpr≡)→ ℘ (AExpr≡)) , (3.23)




f i(), where f ′ = clN (f), (3.24)
3.3 Equivalent Expressions Analysis 95
here, f and f ′, respectively the input and output of clN , are both EEG functions. In
the rest of the section, we omit the brackets surrounding the input of clN for simplicity,
e.g. clN (f) can be written as clN f .
As an example use of the functional clN , we may note that we can substitute f with I
in clN f() to generate a set of equivalent expressions, by taking the union of N steps
of repeated application of I to . By further allowing N to approach∞, we obtain the
full set of equivalent expressions of  that can be discovered using our inference system,










gi(∅), where g(ε) := f(ε) ∪ . (3.26)
We may further omit the ∞ from cl∞ to denote the transitive closure, e.g. the above
example in (3.25) can be simplified to be clI().
In practice, it is often infeasible to generate the full transitive closure of a given expression,
we therefore impose further constraints on how we discover equivalent expressions.
Firstly, instead of exploring the full transitive closure, that is, by allowing the number
of steps N in (3.24) to be infinite, we may restrict N to be a small finite value to allow
a smaller set of equivalent expressions to be computed. In later experiments, we have
chosen N = 10.
Secondly, the complexity of equivalent expression finding is reduced by fixing the structure
of subexpressions at a certain depth k in the original expression. The definition of depth
is given as follows: first the root of the parse tree of an expression is assigned depth
d = 1; then we recursively define the depth of a node as one more than the depth
of its greatest-depth parent. If the depth of the node is greater than k, then we fix
the structure of its child nodes by disallowing any equivalence transformation beyond
96 Chapter 3 Structural Optimization of Arithmetic Expressions
this node. We let Ik denote this “depth-limited” equivalence finding function, where
k is the depth limit used. We can then use clN Ik and clIk to denote the functions to
respectively compute the union of N steps of Ik and the transitive closure. This approach
is similar to Martel’s depth-limited equivalent expression transform [Mar07], however
Martel’s method eventually allows transformation of subexpressions beyond the depth
limit, because rules of equivalence would transform these to have a smaller depth. This
contributes to a time complexity at least exponential in terms of the expression size.
In contrast, our technique has a time complexity that does not depend on the size of
the input expression, but grows with respect to the depth limit k. Note that the full
equivalence closure using the inference system we defined earlier in (3.25) is at least
O((2n− 3) !!) where n is the number of terms in an expression, as we discussed earlier.
As the maximum number of terms in a binary tree with a depth k grows at a rate O(2k),
the number of equivalent expressions that can be discovered is at leastO((2× 2k − 3) !!)
with respect to k. In the production of experimental results, k is chosen to be either 2 or
3.
Finally, we use an iterative algorithm to accelerate the computation of clN f(), where f
is a ∪-distributive EEG (see Definition 3.3) such as Ik. In each iteration, we keep track of
the equivalent expressions that are newly discovered in the current iteration, so that in
the next iteration we apply f only to those expressions, to avoid redundant computation.
This algorithm is shown in Figure 3.1 to efficiently compute clN f(), where f can be Ik.
The correctness of this algorithm is discussed in greater depth in Appendix A.
Definition 3.3. We say an EEG function f is ∪-distributive if and only if the function
satisfies f(a ∪ b) = f(a) ∪ f(b).
The algorithms we have described so far do not incorporate analyses detailed in Sec-
tions 3.1 and 3.2, hence, they do not guide the optimization process with objectives
to minimize. The following section explains how the analyses can be used to steer the
algorithms to optimize the trade-off between accuracy and area in synthesized circuits of
transformed expressions.
3.3 Equivalent Expressions Analysis 97
function CLOSURE(f , N , )
s0 ← 
s′0 ← 





si ← si−1 ∪ s′i






Figure 3.1. Our algorithm to compute clN f(), which discovers a set of equivalent expressions
with a ∪-distributive EEG f from an initial set of equivalent expressions .
3.3.3 Pareto Frontier
Because we optimize expressions in two quality metrics, i.e. the accuracy of computation
and the estimate of FPGA resource utilization, there is a trade-off between them. We
desire the largest subset of all equivalent expressions E discovered such that in this
subset, no expression dominates any other expression, in terms of having both better
area and better accuracy. This subset is known as the Pareto frontier [Leg+10].
Figure 3.2 shows a simplified algorithm for calculating the Pareto frontier for a set
of equivalent expressions , faster algorithms exist. For instance, we could sort these
expressions by accuracy, as a fast sorting algorithm takes O(n log(n)) for n elements, and
then prune suboptimal ones in one pass.
Here, frontier/{e} is a set identical to frontier , except that the element e is removed.
We use the function AbsError to analyze the magnitudes of error bounds as defined
in (3.7).
3.3.4 Equivalent Expressions Semantics
Similar to the analysis of accuracy and resource usage, a set of equivalent expressions
can be computed with semantics. That is, we define structures, i.e. sets of equivalent
expressions, that can be manipulated with arithmetic operators. In our equivalent
98 Chapter 3 Structural Optimization of Arithmetic Expressions
function FRONTIER()
frontier ← 
for e ∈  do
for e′ ∈  do







Figure 3.2. The algorithm used to compute fr(), i.e. the Pareto frontier from a set of equivalent
expressions .
expressions semantics, an element of ℘ (AExpr≡) is used to assign a set of expressions
to each node in an expression parse tree. To begin with, at each leaf of the tree, the
variable or constant is assigned a set containing itself, as for x, the set x of equivalent
expressions is x = {x}. After this, we propagate the equivalence expressions in the parse
tree’s direction of flow, using (3.27) defined below, where fr is the algorithm shown in
Figure 3.2:
x ⊗ y := fr (clIk (E⊗ (x, y))) ,
where E⊗(x, y) = {ex ⊗ ey | ex ∈ x ∧ ey ∈ y} ,
and ⊗ ∈ {+,−,×, /}.
(3.27)
It is noteworthy that we override the meaning of ⊗, from arithmetic computations
originally, to denote the construction of equivalent expressions. The equation implies that
in the propagation procedure, it recursively constructs a set of equivalent subexpressions
for the parent node from two child expressions, and uses the depth-limited equivalence
function clIk to work out a larger set of equivalent expressions. Similarly, we can define
another equation that propagates equivalent subexpressions in an expression with a unary
subtraction:
− := fr (clIk (Eunary ())) ,
where Eunary () = {−e | e ∈ } .
(3.28)
3.3 Equivalent Expressions Analysis 99
To reduce computation effort, we select only those expressions on the Pareto frontier for
the propagation in the DFG. Although in worst case the complexity of this process is
exponential, the selection by Pareto optimality accelerates the algorithm significantly. For
example, consider the sample DFG in Figure 3.3, for the subexpression a+ b, we have:
a + b = fr (clIk (E⊗ (a, b)))
= fr (clIk (E⊗ ({a}, {b})))









Figure 3.3. The DFG for the sample expression (a + b)× (a + b).
Alternatively, we could view the semantics in terms of DFGs representing the algorithm
for finding equivalent expressions. The parsing of an expression directly determines
the structure of its DFG. For instance, consider the tree structure of the expression
e0 = (a+ b)× (a+ b), as shown in Figure 3.3. This tree structure can be used to generate
a DFG illustrated in Figure 3.4, which when data-flow analysis is applied, discovers a set
of equivalent expressions to e0. The circles labeled 3 and 7 in this diagram are shorthands
for the operations E+ and E× respectively, where E+ and E× are defined in (3.27).
a
b
+ ⇥Ik[ Frontier Ik[
1
2
3 4 5 6 7 8 9 10
Frontier
Figure 3.4. The DFG for finding equivalent expressions of (a+ b)× (a+ b).
100 Chapter 3 Structural Optimization of Arithmetic Expressions
For our example in Figure 3.4, similar to the construction of data-flow equations in
Section 2.3 of Chapter 2, we can produce a set of equations from the data-flow of the
DFG, which now produces equivalent expressions:
A(1) = A(1) ∪ {a}, A(2) = A(2) ∪ {b},
A(3) = E+(A(1), A(2)), A(4) = A(3) ∪A(5),
A(5) = Ik(A(4)), A(6) = fr(A(5)),
A(7) = E×(A(6), A(6)), A(8) = A(7) ∪A(9),
A(9) = Ik(A(8)), A(10) = fr(A(9)).
(3.30)
By solving this system of equations for the valueA(10), we find a set of expressions that
are equivalent to the original that produce an optimized trade-off between area and
accuracy. Because of loops in the DFG, it is no longer trivial to find the solution. In
general, the analysis equations are solved iteratively, using the DFA approach discussed
in Section 2.3.1 of Chapter 2. We can regard the set of equations as a single transfer
function F as in (3.31), where the function F takes as input the variables A(1), . . . , A(10)
appearing in the right-hand sides of (3.30) and outputs the values A(1), . . . , A(10) ap-
pearing in the left-hand sides. Our aim is then to find an input ~x to F such that F (~x) = ~x,
i.e. a fixpoint of F .
F ((A(1), . . . , A(10))) = (A(1) ∪ {a}, . . . , fr(A(9))). (3.31)
Initially we assign A(i) = ∅ for i ∈ {1, 2, . . . , 10}, and we denote ~∅ = (∅, . . . ,∅). Then





This expression can be computed iteratively by first evaluating F (~∅), F 2(~∅) = F (F (~∅)),
and so forth, until the fixpoint is reached for some iteration n, i.e. F (Fn(~∅)) = Fn+1(~∅).
Hence, we know that for any iterations m > n+ 1, Fm(~∅) = Fn(~∅). The value n should
be a finite constant, because the relation can only reach a finite number of expressions.
3.3 Equivalent Expressions Analysis 101
In cases when lfpF is computational intensive, we could limit the number of iterations n,
to compute an under-approximation (a subset) of lfpF .
The fixpoint solution lfpF gives a set of equivalent expressions derived using our method,
which is found at A(10). In essence, the depth limit acts as a sliding window. The
semantics allow hierarchical transformation of subexpressions using a depth-limited
search and the propagation of a set of subexpressions that are locally Pareto optimal to
the parent expressions in a bottom-up hierarchy.
The problem with the semantics above is that the time complexity of clIk scales poorly,
since the worst case number of subexpressions needed to explore increases exponentially
with k. Therefore an alternative method is to optimize it by changing the structure of
the DFG slightly, as shown in Figure 3.5. The difference is that at each iteration, the
Pareto frontier filters the results to decrease the number of expressions to process for
the next iteration, whereas the former approach filters the Pareto-suboptimal candidates
only at the end of the iterative procedure. The latter method is therefore pruning the set
of discovered candidates more frequently than the former. Equivalently, this approach
yields the following semantics for arithmetic operations on equivalent expressions as an
alternative to (3.27):
x ⊗ y := cl (fr ◦Ik) (E⊗ (x, y)) , where ⊗ ∈ {+,−,×, /},




+ ⇥Ik[ Frontier Ik[
1
2
3 4 5 6 7 8 9 10
Frontier
Figure 3.5. The alternative DFG for (a+ b)× (a+ b).
In the rest of this chapter, we use frontier_trace to indicate our equivalent expression
finding semantics, and greedy_trace to represent the alternative method.
102 Chapter 3 Structural Optimization of Arithmetic Expressions
In addition to the above approaches, another possibility is to view the optimization in a
perspective from denotational semantics. We can define a recursively-defined function
O [e]σ] which accepts an expression e, and σ], an input condition on the variables.
This function produces a set of optimized expressions equivalent to e. The set of input
conditions will be formally defined in Chapter 4. Therefore for e1, e2 ∈ AExpr and a
variable x:
O [e1 ⊗ e2]σ] = fσ]
({
e′1 ⊗ e′2 | e′1 ∈ O [e1]σ], e′2 ∈ O [e2]σ]
})
,
O [x]σ] = {x},
(3.34)
where f ]σ is a function that transforms and optimizes a set of equivalent expressions
based on the initial condition σ], e.g. fr ◦ clIk or cl (fr ◦Ik) used in frontier_trace
or greedy_trace.
3.3.5 Simultaneous Optimization of Multiple Expressions
To simultaneously optimize multiple expressions, a new operator, the barrier operator
“|”, is introduced to concatenate expressions. Multiple expressions, when concatenated,
allow common subexpressions to be shared. For instance, the expressions a+ (b+ c) and
a× (b+ c) can concatenate to form a new expression e:
e = a+ (b+ c) | a× (b+ c), (3.35)
and the subexpression b+ c is shared within e, as this behaviour naturally arises from the
resource usage analysis discussed in Section 3.2.
The barrier operator has no rules of equivalence to discourage any transforms across the
expression boundaries. For a single expression, the quality of accuracy is determined by
its round-off error. However, when evaluating multiple expressions there are choices to
make in terms of determining the overall accuracy of the system of multiple expressions.
The reasons are two-fold [Mar09]. Firstly, perhaps only a user-defined subset of the
expressions that compute the final results are influential. Secondly, we may wish to
minimize either the L1-, L2-, L∞-norm, or a geometric mean of the errors, depending on
3.3 Equivalent Expressions Analysis 103
which one is more relevant. For this reason, SOAP provides the choice to minimize any
of these above norms, and is designed to minimize the L∞-norm of all expressions by
default, by computing the least upper bound of the following accuracy semantics of two
expressions joined by the barrier operator:
(x]1, µ
]
1) | (x]2, µ]2) := (x]1, µ]1) unionsq (x]2, µ]2), (3.36)
where the unionsq operator on the abstract semantics is introduced in Section 2.3.4. This
approach computes the worst-case bound on errors encountered in the evaluation of all
individual expressions in the system of multiple expressions.
With the extended accuracy analysis in place, the automatic multi-objective optimization
techniques for single expression can be adapted to multiple expressions, by concatenating
them into a single expression using the barrier operator.
3.4 Implementation
The majority of SOAP, is implemented in Python. For computing errors in real arithmetic,
we use exact arithmetic based on rational numbers within the GNU Multiple Precision
Arithmetic Library (GMP) library [Gra+91]. In case when exact arithmetic is not possible
because of high computational costs, floating-point arithmetic can be used to efficiently
and safely bound round-off error values. We also use the MPFR library [Fou+07] for
access to floating-point rounding modes and arbitrary precision floating-point computa-
tion.
Because of the workload of equivalent expression finding, the underlying algorithm is
optimized in many ways. Firstly, for each iteration, the relation finding function Ik is only
applied to newly discovered expressions in the previous iteration, using the algorithm
in Figure 3.1. The second optimization is to cache results of function calls such as Ik,
Area and Error, since there is a large chance that these results from subexpressions are
reused several times, subexpressions are also maximally shared to eliminate duplication
in memory. Thirdly, the computation of Ik is fully multi-threaded.
104 Chapter 3 Structural Optimization of Arithmetic Expressions
The resource statistics of operators are provided using FloPoCo [DP11] and Xilinx Synthesis
Technology (XST) [Xil13]. Initially, For each combination of an operator, an exponent
width between 5 and 16, and a mantissa width ranging from 10 to 113, a total of 2496
distinct implementations are generated using FloPoCo. All of them are optimized to
use DSP blocks. They are then synthesized using XST, targeting a Virtex-6 FPGA device
(XC6VLX760). Because LUTs are generally more constrained resources than DSP blocks in
floating-point computations, only synthesis statistics in LUTs is provided. Finally, an RTL
code generation backend can produce synthesizable code from an optimized candidate
expression.
3.5 Results
Because Martel’s approach defers selecting optimal options until the end of equiva-
lent expression discovery, we developed a method that could produce exactly the same
set of equivalent expressions from the traces computed by Martel, and has the same
time complexity. The difference is that we adopted it to generate a Pareto frontier
from the discovered expressions, instead of only error bounds. This allows us to com-
pare martel_trace, i.e. our implementation of Martel’s method, against our methods
frontier_trace and greedy_trace discussed in Section 3.3. Figure 3.6 optimizes
the expression (a + b)2 using the three methods above, all using depth limit 3, and the
input ranges are a ∈ [5, 10] and b ∈ [0, 0.001] [Mar07]. The IEEE 754 single-precision
floating-point format with rounding to nearest was used for the evaluation of accuracy
and area estimation. The scatter points represent different implementations of the origi-
nal expression that have been explored and analyzed, and the (overlapping) lines denote
the Pareto frontiers. In this example, our methods produce the same Pareto frontier
that Martel’s method could discover, while having up to 50% shorter run time. Because
we consider an accuracy/area trade-off, we find that we can not only have the most
accurate implementation discovered by Martel, but also an option that is only 0.0005%
less accurate, but uses 7% fewer LUTs.
3.5 Results 105
We go beyond the optimization of a small expression, by generating results in Figure 3.7
to show that the same technique is applicable to simultaneous optimization of multiple
large expressions. The expressions e1 and e2, with input ranges a ∈ [1, 2],b ∈ [10, 20],c ∈
[10, 200] are used as our example:
e1 =(a + a + b)× (a + b + b)× (b + b + c)×
(b + c + c)× (c + c + a)× (c + a + a),
e2 =(1 + b + c)× (a + 1 + c)× (a + b + 1).
(3.37)
We generated and optimized RTL implementations of e1 and e2 simultaneously using
frontier_trace and greedy_trace with the depth limits indicated by the numbers
in the legend of Figure 3.7. Note that because the expressions evaluate to large values, the
errors are also relatively large. We set the depth limit to 2 and found that greedy_trace
executes up to 10× faster than frontier_trace, while discovering a sizable subset
of the Pareto frontier of frontier_trace. Also our methods are significantly faster
and more scalable than martel_trace, because of its poor scalability discussed earlier,
our computer ran out of 8 GB of memory before we could produce any results. If
we normalize the time allowed for each method and compare the performance, we
found that greedy_trace with a depth limit 3 takes takes slightly less time than
frontier_trace with a depth limit 2, but produces a generally better Pareto frontier.
The alternative implementations of the original expression provided by the Pareto frontier
of greedy_trace can either reduce the LUTs used by approximately 10% when accuracy
is not crucial, or can be about 10% more accurate if resource is not our concern. It also
enables the ability to choose different trade-off options, such as an implementation that
is 7% more accurate and uses 7% fewer LUTs than the original expression.
Furthermore, Figure 3.8 varies the mantissa width of the floating-point format, and
presents the Pareto frontier of both e1 and e2 together under optimization. Floating-
point formats with mantissa widths ranging from 10 to 112 bits were used to optimize
and evaluate the expressions for both accuracy and area usage. It turns out that some
implementations originally on the Pareto frontier of Figure 3.7 are no longer desirable, as
106 Chapter 3 Structural Optimization of Arithmetic Expressions
by varying the mantissa width, new implementations are both more accurate and less
resource demanding.
Besides the large example expressions above, Figure 3.9 and Figure 3.10 are produced by
optimizing expressions with real applications under single precision. Figure 3.9 shows the
optimization of the Taylor expansion of sin(x + y), where x ∈ [−0.1, 0.1] and y ∈ [0, 1],
using greedy_trace with a depth limit 3. The function taylor(f, d) indicates the Taylor
expansion of function f(x, y) at x = y = 0 with a maximum degree of d. For order 5 we
reduced error by more than 60%. Figure 3.10 illustrates the results obtained using the
depth limit 3 with the Motzkin polynomial [Dem11] x6 + y4z2 + y2z4 − 3x2y2z2, which
is known to be difficult to evaluate accurately, especially using inputs x ∈ [−0.99, 1],
y ∈ [1, 1.01], z ∈ [−0.01, 0.01].
All these above results are generated with the same type of floating-point operators in
each expression. Although in this chapter we do not analyze the number of DSPs used in
synthesized circuits, the DSP count increases linearly with the estimated LUT count. In
the next chapter we further introduce the estimation of DSP elements used as another
objective to optimize.
Because of the scalability problem of the depth limit k mentioned in Section 3.3, k ≤ 3 for
all of our experiments. By setting k = 4, the tool does not terminate in reasonable amount
of time and saturates the memory (16 GB) of our system. In the following chapters,
we propose methods to limit the number of iterations and the number of equivalent
expressions discovered to mitigate the lack of scalability of k.
Finally, Figure 3.11 demonstrates the accuracy of the area estimation used in our analysis.
It compares the actual LUTs necessary with the estimated number of LUTs using our
semantics, by synthesizing more than 6000 equivalent expressions derived from a+b+c,
(a + 1) × (b + 1) × (c + 1), e1, and e2 using varying mantissa widths. The dotted line
indicates exact area estimation, a scatter points that is close to the line means the area
estimation for that particular implementation is accurate. The solid black line represents
the linear regression line of all scatter points. On average, our area estimation is a 6.1%
3.5 Results 107
over-approximation of the actual number of LUTs, and the worst case over-approximation
is 7.7%.
600 800 1000 1200 1400 1600

















greedy trace, 3 (0.04s)
frontier trace, 3 (0.08s)
martel trace, 3 (0.09s)
((a + b) * (a + b))
(((a + b) * a) + ((a + b) * b))
((((a + b) + a) * b) + (a * a))
((((a * b) + (b * b)) + 
(a * b)) + (a * a))
Figure 3.6. Optimization of (a + b)2.
8000 9000 10000 11000 12000














×106 frontier trace, 2 (2.97s)
greedy trace, 3 (2.39s)
greedy trace, 2 (0.27s)
martel trace, 2 (out of memory)
original
Figure 3.7. Simultaneous optimization of both e1 and e2.
3.6 Summary
This chapter provides a formal approach to the optimization of arithmetic expressions
for both accuracy and resource usage in high-level synthesis. The method proposed
108 Chapter 3 Structural Optimization of Arithmetic Expressions
0 20000 40000 60000



























Figure 3.8. Varying the mantissa width of Figure 3.7.
in this chapter and the associated tool, SOAP, encompass three kind of semantics that
describe the accumulated roundoff errors, count operators in expressions considering
common subexpression elimination, and derive equivalent expressions. For a set of
input expressions, the proposed approach works out the respective sets of equivalent
expressions in a hierarchical bottom-up fashion, with a windowing depth limit and Pareto
selection to help reduce the complexity of equivalent expression discovery. Using SOAP,
we improve either the accuracy of our sample expressions or the resource utilization by
up to 60%, over the originals under single precision. SOAP enables a high-level synthesis
tool to optimize the structure as well as the precision of arithmetic expressions, then
to automatically choose an implementation that satisfies accuracy and resource usage
constraints.
Because we underpin our approach in formal semantics, it provides the necessary founda-
tion which permits us to extend the method for general numerical program transformation
in high-level synthesis. Therefore in Chapter 4, we base ourselves on the methodologies
developed in this chapter, and propose a structural approach to program optimization by
safely rewriting equivalent structures in numerical programs.
3.6 Summary 109
2000 4000 6000 8000













×10−7 taylor(sin(x + y),2) (0.05s)
taylor(sin(x + y),2) original
taylor(sin(x + y),3) (0.28s)
taylor(sin(x + y),3) original
taylor(sin(x + y),4) (1.14s)
taylor(sin(x + y),4) original
taylor(sin(x + y),5) (7.75s)
taylor(sin(x + y),5) original
Figure 3.9. The Taylor expansion of sin(x+ y).
2800 3000 3200 3400




















martel trace (out of memory)
original
Figure 3.10. The Motzkin polynomial em.
110 Chapter 3 Structural Optimization of Arithmetic Expressions
0 1 2 3 4 5





























The previous chapter introduced a new methodology to efficiently restructure arithmetic
expressions for the optimized trade-off between two performance metrics,i.e. numerical
accuracy when evaluated and area usage in synthesized FPGA implementations. This
method however has a substantial limitation when applied to general numerical programs,
that is, it can only be applied to straight-line codes without control structures such as
branches and loops.
In this chapter, a new general program optimization technique for numerical algorithms is
therefore proposed, which analyzes and optimizes if statements as well as while loops
in a numerical program. This enables the joint optimization of accuracy and resource
usage, as well as the trade-off between both performance metrics. A tool is thus developed
to perform source-to-source optimization of numerical programs targeting FPGAs, and
generate implementations that trade off resource usage and numerical accuracy.
Similar to the approach proposed in Chapter 3, equivalence rules such as associativity
(a + b) + c ≡ a + (b + c), and distributivity (a + b)× c ≡ a× c + b× c are exploited
to automatically optimize implementations for the optimal trade-off between resource
usage, i.e. the number of LUTs and DSP elements utilized, and accuracy when eval-
uated using floating-point computations. This process generates a Pareto frontier of
optimized numerical programs. For example, with single-precision floating-point for-
mat, the tool finds that given an input x ∈ [0, 100] and y ∈ [0, 2], then the program:
if (x < 1) {
x = (x + y) + 0.1f;
} else {
x = x + (y + 0.1f);
}
is most accurate when the subexpression “(x+y)+0.1f” is rewritten as “(x+0.1f)+y”;
113
on the other hand, the original program uses fewest resources when subexpressions are
shared and the if statement is eliminated:
x = x + (y + 0.1f);
The structural optimization of general numerical programs is much more complex than
that of arithmetic expressions. The reasons are three-fold. First, during program execu-
tion, variables are often updated with new values. Our optimization should therefore
perform static analysis on the values of variables, and use the result to optimize specifi-
cally for the trade-off between accuracy and resource usage. Second, the combinatorial
explosion of expression equivalence, is further exacerbated by the expressiveness of a
general numerical program. Finally, it is much more difficult to formally define program
equivalence, and subsequently, to search efficiently for optimized equivalent programs.
The difficulty in defining program equivalence is due to the fact that two programs can be
identical in function, but have distinct syntactic structure because of the expressiveness
of a HLL. In fact, one can easily imagine infinitely many ways to rewrite numerical
programs, and often these equivalent programs have identical resource usage, latency
and accuracy characteristics. In practice, it is desirable to eliminate as much as possible
the need for these syntactic rewrites that do not affect our performance metrics, so that
the search space of equivalent programs is greatly reduced.
We explored in Section 2.4 of Chapter 2 various intermediate representations (IRs) de-
signed for program transformation, and specifically examined program expression graph
(PEG) in closer details, because it fits our requirement to abstract away as much irrelevant
syntactic information as possible. However, PEG is not suited for the optimization of
numerical accuracy because cycles in graphs are used. Evaluating the numerical accuracy
of an IR with a cyclic structure requires analyzing a large proportion of the IR, whereas
a tree structure can be reasoned compositionally in a bottom-up hierarchy. By intro-
ducing equivalence edges in the graph, the number of elementary cycles in the graph
could increase exponentially in the number of equivalences that have been discovered,
which further exacerbates the already high computational demand. This has significant
implications for program transformation. First, for the above reason, their approach
114 Chapter 4 Numerical Program Optimization
does not use static analysis to optimize for numerical accuracy, while we wish to reason
about accuracy and utilize them to steer optimization. Moreover, the tree structure
allows us to easily support partial loop unrolling by simply extending the equivalence
relations while avoiding re-evaluation. In contrast, like [Mar09] and [Dam+15], equality
saturation is unable to perform partial loop unrolling. In this chapter, a solution to the
above limitations of PEGs is therefore proposed, by introducing a new tree-based IR with
fixpoint constructs to specifically tackle the problem of program equivalences.
Additionally, as none of the methods in Section 2.5 of Chapter 2 looks at the multiple-
objective optimization of numerical programs, this chapter is the first to propose a tool
that performs a semantics-directed program transformation, which optimizes not only
arithmetic expressions, but also numerical programs. The tool optimizes for the trade-off
between numerical accuracy and resource usage when synthesized to FPGAs.
The optimization flow is designed to be flexible, and its implications are three-fold.
Firstly, arithmetic computations can be optimized across assignments, if statements and
while loops. Secondly, we automatically explore the numerical implications of partial
loop unrolling and loop splitting, which can create more opportunity for minimizing
round-off errors, hence further increases range of options in the Pareto frontier of trade-
offs. Finally, our method naturally subsumes constant propagation, redundant code
elimination, and also branch and loop fusions.
The main contributions in this chapter are as follows:
1. Metasemantic intermediate representation (MIR), A new IR of the behaviour of
numerical programs. Its structure is designed to be manipulated and analyzed with
ease.
2. Semantics-based analyses that reason about not only the resource utilization (num-
ber of LUTs and DSP elements), and safe ranges of values and errors for programs,
but also potential errors such as overflows and non-termination. This provides the
cost functions that we wish to minimize by rewriting programs.
115
3. A new framework of numerical program transformations is developed based on
the methods in Section 3.3 of Chapter 3. It enables the back and forth translation
between the program and MIR, which preserves the semantics of the original
program in real arithmetic.
4. The updated tool, SOAP, which trades off resource usage and accuracy by providing
a safe, semantics-directed and flexible optimization targeting numerical programs
for high-level synthesis. Experimental results are presented in Section 4.7.
This chapter is organized as follows. We start by defining our program syntax in Sec-
tion 4.1. Using the syntax definition, this section gives a detailed formal explanation
of the numerical program transformation, which consists of three stages. Sections 4.2
describe how how numerical programs can be translated into MIRs. Sections 4.3 and 4.4
respectively discuss how we infer bounds and error bounds on variables and analyze
resource usage estimates. Section 4.5 explains how these analyses can be used to effi-
ciently discover equivalent structures in the analyzed MIR. Section 4.6 explains how a
chosen MIR can be translated into an optimized numerical program. Then we present
the optimization results in Section 4.7 and finally Section 4.8 concludes this chapter.
4.1 Syntax Definition
Before we discuss program transform, we first look at the syntax definition used to write
numerical programs. Our program transformation optimizes programs written in a subset
of C99. In this section, we formally introduce the syntax that constitutes a subset of C that
supports arithmetic and Boolean expressions, conditional branches, as well as while and
for loops. Our language allows numerical data types int and float, respectively
standing for integer and floating-point types.
We define AExpr,BExpr as the set of arithmetic and Boolean expressions respectively,
and Stmt denotes the set of program statements. We then have following simplified
116 Chapter 4 Numerical Program Optimization
syntax definition for expressions and numerical programs, written in the Backus-Naur
form [Knu64]:
a ::= n
∣∣∣∣x ∣∣∣∣-a1 ∣∣∣∣ a1 ⊗ a2, b ::= !b1 ∣∣∣∣ a1 ⊕ a2 ∣∣∣∣ b1  b2,
s ::= [t]x [= a] ;
∣∣∣∣ s1 s2 ∣∣∣∣if (b) {s1} [else {s2}] ∣∣∣∣while (b) {s},
x ::= v, v ::= x
∣∣∣∣y ∣∣∣∣z ∣∣∣∣ . . . , t ::= int ∣∣∣∣float.
(4.1)
We define ⊗ ∈ {+,-,*,/}, ⊕ ∈ {<,<=,>,>=,==,!=}, and  ∈ {||,&&} respectively
to be the arithmetic, comparison and Boolean operators, n is a numerical constant of
type either int or float. The v terms x,y,z ∈ Var are variables. In the next chapter,
the definition of x is further extended to arrays. Additionally, a, a1, a2 ∈ AExpr are
arithmetic expressions, and b, b1, b2 ranges over Boolean expressions, BExpr. Program
statements s, s1, s2 ∈ Stmt comprise assignment statements, sequential statements,
if branches and while loops. Although for loop is not explicitly defined in the above
syntax definition, it can be trivially expressed using a while loop. Finally, t is an optional
keyword which can be either int or float to specify the type of x if x is not previously
declared. Note that a term of the form [d] indicates d is optional.
Furthermore, we introduce the “#pragma soap begin” and “#pragma soap end”
directives to delimit the code fragment to be optimized. We can also use “#pragma
soap in” and “#pragma soap out” to provide input ranges and to declare output
variables, respectively.
As a simple example, the program in Figure 4.1 computes an approximate value of pi2a/6.
It has two inputs a, a floating point value between 0 and 1, and n, an integer value
between 10 and 20, which determines the number of iterations for the loop, and a return
variable y.
Despite the simplicity of our syntax, it includes all the features of a full programming
language rather than an arithmetic expression language used in Chapter 3. We will add
4.1 Syntax Definition 117
#pragma soap begin
#pragma soap in \
float a = [0.0, 1.0], int n = [10, 20]
#pragma soap out float y
x = 0;
y = 0.0f;
while (x < n) {
x = x + 1;
y = y + a / (x * x);
}
#pragma soap end
Figure 4.1. A simple program, basel, written with our syntax definition.
support for arrays and matrices in Chapter 5, and show that this can be added with few
changes to our method.
4.2 Metasemantic Intermediate Representation
There are infinite number of ways to rewrite numerical C programs, and many of these
rewrites produce programs that have the same resource usage, accuracy and latency
characteristics. For instance, consider the following two pairs of programs, where each
pair are equivalent, but syntactically different, as they carry out the same (and potentially
redundant) computations.
x = x + 1;
y = 2 * x;
x = x + 3;
(a)P1
y = x + 1;
x = y;
y = y * 2;
x = x + 3;
(b)P ′1
x = x + 1;
if (b)







Figure 4.2. Two pairs of programs that are equivalent but syntactically different.
In practice, it is desirable to eliminate as much as possible the need for these syntactic
rewrites that do not affect our performance metrics, e.g. the numerical accuracy and
resource usage of synthesized circuits. It is therefore desirable to perform transformations
on a DAG representation of the program, rather than on the program text directly. This
118 Chapter 4 Numerical Program Optimization
section introduces a new DAG-based IR, which we call the metasemantic intermediate
representation (MIR). It expresses how each program variable is updated while preserving
the control- and data-flow of the original program, but abstracts away the order in which
the updates occur, and ignores any temporary variables that are not marked as program
outputs.
As an example, the two equivalent programs P1 and P ′1 in Figure 4.2 can be automatically







MIRs also abstract the control structure (i.e. if branches and while loops) of a program,
preserving only the computations that lead to the outputs. For instance, by using the
ternary conditional operator “?” from C, programs with conditionals such as P2 and P ′2 in







This representation is useful to us, because a single MIR is able to capture a class of
syntactically-distinct programs, all of which have the same resource usage, accuracy,
and latency characteristics. By searching for transformations on MIRs, we drastically
reduce the size of our search space. Note that expressions in the MIR can share common
structures; this is useful for modeling the sharing of common subexpressions and makes
the search for optimizations much more efficient.
The first step of our approach is to analyze the program return value into a MIR. This
procedure is called metasemantic analysis (MA). The MA abstracts away irrelevant in-
formation, and preserves the essence of program execution. Details such as temporary
variables and the ordering of program statements are discarded, whereas the abstraction
4.2 Metasemantic Intermediate Representation 119
still retains data-flow dependencies and keeps only computations that contribute to the
final results.
We work with the MIR as an abstraction of the program because the discovery of equiva-
lent structures can be much simplified. For instance, the program “x = 1; y = 2;” is
the same as “y = 2; x = 1;” because interleaving of non-dependent statements does
not change program semantics. If we were to base our transformations on the program
syntax, we will need to enable this kind of equivalence relation even though it has zero
impact on our optimization with respect to resource usage and accuracy. A simpler IR
means that we can explore a much smaller search space.
Our method analyzes a program by recursively dividing the program into smaller parts,
where each part can be separately analyzed into a MIR and composed together to form
a single MIR. A MIR is a mathematical object that associates each program variable
with a semantic expression. A semantic expression is an arithmetic expression, but with
additional syntactic features to support if statements and while loops. We represent
semantic expressions with DAGs that share common structures and define SemExpr
as the set of semantic expressions, and MIR as the set of MIRs. Because a MIR pairs a
variable with an expression, we can view it as a function Var→ SemExpr that maps a
variable into a semantic expression. For instance,µ(x) returns the associated expression
of the variable x ∈ Var for the MIR µ ∈MIR. For each variable, its semantic expression
in itself provides a complete picture of how computations can lead to the resulting value
of the variable. In the rest of this section, we progressively explain how each type of
program statement defined in (4.1) is analyzed into a MIR.
Similar to arithmetic expressions, which can be written in a linear form (e.g. a+ b), or in
a tree structure (e.g. +
a b
), MIRs can also be expressed in both. The former is more
concise, whereas the latter explicitly shares common structures.
120 Chapter 4 Numerical Program Optimization
4.2.1 Assignment statements
An assignment statement is in the form of “x = e;”, where x ∈ Var is a program variable
and e ∈ AExpr is an arithmetic expression. The metasemantic analysis of it produces a
MIR as follows: y 7→






The MIR in (4.4) signifies for a variable y ∈ Var, if y is x, then we assign the expression
e to the variable x. In graph form, e is a semantic expression represented with a DAG.
The DAG shares all common subexpressions in e. For instance, an expression written
as (x + 1) × (x + 1) shares the subexpression node x + 1 by reusing the node in the
DAG. For each other program variable y ∈ Var, where y 6= x, y is associated with a
semantic expression y ∈ SemExpr, representing that the MIR does not alter the value
of all program variables except x, because only x is updated in the statement.
For example, we consider a program with two variables x and y. Analyzing the statement
“y = x * 2;” produces the following MIR graph, it is notable that the variable x is
shared between two semantic expressions:




A sequential statement, “s1s2” is formed by joining together s1 and s2, where s1, s2 ∈
Stmt are statements. It signifies that s1 and s2 are executed in sequence. Therefore, it
is necessary to append the effect of executing s2 to that of s1, to arrive at the full MIR
of “s1s2”. This concept can be realized by defining a new operator ?, the composition
operator, such that the MIR of “s1s2” is equal to µ2 ? µ1, where µ1 and µ2 are the MIRs
of s1 and s2 respectively. The resulting MIR of µ2 ? µ1 is constructed by substituting,
for every expression e ∈ SemExpr in µ2, each variable x in e with µ1(x), which is the
4.2 Metasemantic Intermediate Representation 121
associated expression of x in µ1. It is noteworthy that µ1 is always computed before µ2
when evaluated, as there is a data-dependence from µ2 to µ1. Furthermore, the operator
allows the format e?µ, where e ∈ SemExpr is called the target expression and µ ∈MIR
is the source MIR, to mean the variables in e is substituted with µ using the composition
strategy above.
We illustrate this by finding the MIR of a simple example program, “ x=x+1; y=x*2;”.
Using the MIR of assignments, the MIRs of the respective assignment statements can be
derived, as shown in (4.6).





y y . (4.6)
By substituting the variables with corresponding expressions, we arrive at the MIRs as







Conditional branches, or if statements, are represented with “if (b) {s1} else {s2}”.
Here b ∈ BExpr is a Boolean expression, and s1, s2 ∈ Stmt are respectively the true-
and false-branches. Our analysis of if statements is slightly more complex, as we start to
consider control flows. The analysis is carried out in two steps. The first step is to compute
recursively, the MIRs µ1, µ2 ∈MIR of the respective true- and false-branches, namely, s1
and s2. We introduce the conditional node “?” which is derived from C syntax, to signify
conditional branches in expressions. The left-most, middle and right-most children of this
node are respectively the Boolean expression, the true- and false-expressions. Then the
second step is to compute a new MIR, where each program variable x ∈ Var is associated
122 Chapter 4 Numerical Program Optimization
with a conditional node with three children, the Boolean expression b, µ1(x) and µ2(x).
The final MIR is therefore:
[x 7→ b ? µ1(x) : µ2(x)]x∈Var. (4.8)
As an example, we consider the program “if (x < 0) y = x * 2;”, where the set









Because both true- and false-expressions of x are the same, regardless of the truth value
of x < 0, the two expressions evaluate to the same value. In our analysis we can









The traditional approach of program abstraction uses CDFGs [Nam+04]. CDFGs preserves
the ordering of sequential statements, uses a one-to-one mapping from assignment
statements to assignment nodes, uses storage nodes to store the result of assignments,
i.e. it allows nodes to act as a memory to store values, and finally, contains cycles in
graphs to represent program loops. In contrast, our MIRs, from our analysis point of
view, use no local storage of temporary values, and discard unnecessary intermediate
statements. Most importantly, we treat control structures as operators in expressions,
in the same way as arithmetic computations, and we will show later in Section 4.5 this
enables greater flexibility in program transformation based on MIRs. In comparison with
CDFGs, these above facts make MIRs a more suitable candidate for exploring the search
space of program transformations.
4.2 Metasemantic Intermediate Representation 123
4.2.4 Loops
A possible way to represent a while loop, “while (b) {s}”, where b ∈ BExpr and
s ∈ Stmt, is to effectively analyze it as a fully unrolled loop. Unfortunately the resulting
semantic expressions have an infinite depth, which cannot be represented fully in a data
structure. Because these expressions have recurring patterns, we therefore introduce a
new operator for semantic expressions, “fix”, which we call the fixpoint operator, to cap-
ture this pattern. For each variable x updated in loop, we further use the fixpoint expres-
sion fix(f) to represent the while loop, where the function f : SemExpr→ SemExpr,
defined as f(ex) = b ? ex ? µs : x, represents the computation of one iteration of the loop,
and µs is the MIR of the loop body. Finally, the fixpoint expression of a while loop can be
written more succinctly using lambda-expression [Bar13], i.e. fix (λex . b ? ex ? µs : x).
In graph form, for simplicity the fixpoint node admits three child nodes, namely, the
Boolean expression b, the loop body represented by a MIR, and the loop exit variable.
The loop body MIR can be obtained with our MA of the loop body, and the loop exit
variable denotes which variable we use on loop exit as the evaluated result of the fixpoint
expression. We let µs to be the MIR of the loop body s, and derive the MIR of the










Here, var (µs) computes the set of variables that is assigned in the loop body µs. If a




otherwise x is not updated in the loop, and the loop has no effect on its value, therefore
it is paired with an expression x. Finally, the constructed MIR maximally shares common
expressions across nested MIRs.
124 Chapter 4 Numerical Program Optimization
As a simple example, consider the program in Figure 4.3a. The MIR of this example
program allows a nested MIR to reference and reuse an expression from the outer MIR as
demonstrated in Figure 4.3b, where:
µ = [x 7→ x + 1], and b = x < y. (4.12)
y = y + 1;
while (x < y) {










Figure 4.3. A simple program which exhibits common subexpressions reuse across nested MIRs.
4.2.5 Example analysis
To illustrate all these above translation process in conjunction, we can perform the MA
on the program basel in Figure 4.1. Here, we translate it into the following MIR, and
the semantic expression for x is omitted for simplicity:


















Because we make use of accuracy analysis to navigate the Pareto optimization of program
candidates, we start by providing an overview of how the accuracy of simple arithmetic
expressions are analyzed with the SOAP framework, and since it only allows arithmetic
4.3 Accuracy Analysis 125
expressions with simple operators {+,−,×}, we explain how it can be extend fully to
analyze MIRs and semantic expressions.
In a typical program execution, values of variables, typically integers Z and floating-point
values F, are modified according to the effect of the program statements, and they are
propagated through arithmetic operators from the beginning to the end of the program.
In Section 3.1 of Chapter 3, alternative semantics are proposed to instead propagate
ranges of values together with the associated round-off error bounds,i.e. the value-error
bound (v], e]) ∈ E], in order to analyze the accuracy of floating-point numerical programs.
In this section, this technique is further generalized to numerical programs.
Initially, we formalize the analyzed program values of a program as an abstract program
state using the domain ΣE] = Var→ E], and a σ] ∈ ΣE] maps each variable x to their
associated value-error bound σ](x) ∈ E]. The purpose of this abstract program state is to
provide information on the bounds on values and errors for all variables in any particular
location of program execution. For instance, we assume an abstract state σ]0 ∈ ΣE] which









[1, 2] , e]0
)]
. (4.14)
This means that initially a and b are floating-point values bounded by [0, 1] and [1, 2]
respectively, and the error interval e]0 = [0, 0] denotes the absence of round-off errors.
In Section 3.1 of Chapter 3 we introduced the functionError : AExpr→ E] to evaluate
the bounds on the result and its round-off error of computing an expression, but we now
extend it to explicitly make use of the input ranges of variables. Here we introduce a new
function Es : SemExpr→ ΣE] → E] which further accepts initial bounds on the values
and errors of variables. The formula Es [e]σ], where e ∈ SemExpr and σ] ∈ ΣE] , is used
to denote the accuracy analysis of the expression e with the input state σ].
126 Chapter 4 Numerical Program Optimization
Then in single-precision, the error analysis of a+ b, given the initial bounds σ]0 in (4.14),
produces the following result:




−1.19209304× 10−7, 1.19209304× 10−7
])
. (4.15)
This means that the result of this computation is in the range of v] = [1, 3], and the
round-off error induced by this computation is bounded by the interval e].
The method outlined in Chapter 3 to analyze the accuracy of arithmetic expressions
supports only addition, subtraction, multiplication and division. In this section, we
explain in detail how it is extended to support MIRs, and our additional operators in
semantic expressions, i.e. the composition, ternary conditional and fixpoint operators.
4.3.1 MIR
In the same way that an expression can be analyzed for its accuracy, a MIR, which is
a mapping of variables to semantic expressions, can be analyzed by performing the Es
analysis for each of its expressions. For instance, the accuracy of a MIR:
µ0 = [a 7→ a + b,b 7→ a× 0.5] , (4.16)
with an input state:
σ]0 = [a 7→ ([0, 1] , [0, 0]) ,b 7→ ([1, 2] , [0, 0])] , (4.17)
can be analyzed as follows.
First we analyze the individual expressions µ0(a) = a + b and µ0(b) = a × 0.5, which





















−2.98023259× 10−8, 2.98023259× 10−8
])
. (4.19)
4.3 Accuracy Analysis 127
Then the analyzed results are collected into an abstract state assigning the value-error
bounds to their corresponding variables, that is:
[
a 7→ (v]a, e]a),b 7→ (v]b, e]b)
]
. (4.20)
To generalize, we can formally define a function, Em [µ]σ], to perform the above analysis,
which takes as inputs the MIR µ ∈MIR and an abstract input state σ] ∈ ΣE] . It computes
a new state σ]′, where for each variable x ∈ Var, σ]′(x) is the analyzed value-error range
of the expression µ(x). The error analysis of a MIR is therefore:
Em [µ]σ] =
[
x 7→ Es [µ(x)]σ]
]
x∈var(µ). (4.21)
The notation [x 7→ Es [µ(x)]σ]]x∈var(µ) means that the mapping is constructed by collect-
ing for each variable x ∈ var (µ), the pairing of x with the analyzed value-error bound of
the semantic expression µ(x).
4.3.2 Composition Operator
The analysis of an expression e ? µ, where e ∈ SemExpr and µ ∈MIR is carried out in
two steps. Initially, given an input state σ], µ is analyzed using (4.21), and we write σ]′
as the analyzed state. Then the expression e is analyzed for its accuracy as usual, using
σ]
′ as the input state. Equivalently, this procedure can be defined as:





4.3.3 Ternary Conditional Operator
A conditional expression is written as b?e1 :e2, where b ∈ BExpr and e1, e2 ∈ SemExpr.
The truth value of Boolean expression b determines whether e1 or e2 is evaluated to be the
resulting value of the expression. Correspondingly, in our accuracy analysis, we impose
a constraint defined by the Boolean expression b on the value ranges of input variables,
128 Chapter 4 Numerical Program Optimization
such that e1 is evaluated with the ranges of values satisfying the constraint, while e2 is
computed with ranges that violate the constraint.
For example, we analyze an expression (a < 0) ? (a− 0.1) : a in single-precision. Initially,
we assume the program state consists of a variable a, which is a floating-point value that
has no associated round-off error and is bounded by [−1, 10], that is:
σ]0 = [a 7→ ([−1, 10] , [0, 0])] . (4.23)
We consider two cases, when the condition a < 0 is respectively true and false. For
a < 0 to be true, a must be in the range of v]b = [−1, 0−], where 0− is the greatest
single-precision floating-point value less than 0, because a must be strictly smaller than 0.














Similarly, when a < 0 is false, we restrict the bound on a with v]¬b = [0, 10] and analysis
of the expression a simply gives v]2 = [0, 10] and no round-off error, e
]
2 = [0, 0]. Finally, the
analyzed value and error ranges for the expression can be obtained by joining these two
cases together, by respectively evaluating v]1 unionsq v]2, which produces v] = [−1.10000002, 10],
and the error bound e]1 unionsq e]2. The final result is therefore:
Es [(a < 0) ? (a− 0.1) : a]σ]0
= ([−1.10000002, 10] , [−5.81145372e− 08, 6.10947666e− 08]) .
(4.25)
Our analysis is based on interval arithmetic which is very efficient but it sacrifices accuracy
by computing an over-approximation of the exact results. For instance, the above analysis
cannot capture the fact that all evaluated result v satisfies either −1.1 ≤ v < −0.1
or 0 ≤ v ≤ 10. The majority of such information losses occur because IA computes
loose bounds on the analyzed floating-point results. In contrast, joining two error
bounds generally produces a precise bound, because error bounds are often much less
correlated, and they often overlap as there is a high chance that there exists an arithmetic
4.3 Accuracy Analysis 129
computation which produces an exact floating-point outcome. In Section 4.7, empirical
results show that despite the accuracy analysis potentially produces loose bounds, it can
still be used effectively as an indicator of the round-off errors in actual executions.
We now provide a formal definition of the above example analysis. We use the notation
σ]|b and σ]|¬b, where σ] ∈ ΣE] is the program state, and b ∈ BExpr is the Boolean
expression, to respectively mean the program state σ] is constrained by either b being
true or false. Therefore, the following formula is used to perform the accuracy analysis
















An expression with a fixpoint operator, fix
b µs x
, has three child nodes, the Boolean
expression b ∈ BExpr, the loop body represented with a MIR µs ∈MIR, and the return
variable x ∈ Var. Similar to executing a while loop, evaluating the expression is to
iteratively evaluate b for its truth value, if b is true, then the loop MIR µs is used to update
the program state for the next iteration and we repeat the process and iterate until b is
evaluated to false.
Before we explain how a fixpoint expression can be analyzed for its accuracy, we introduce
the concept of loop invariant (LI). In our context, a LI of a while loop is a set of bounds
on loop variables that holds invariantly on entry to each loop iteration. In Section 2.3 of
Chapter 2, we explain how the LI of a simple program loop can be computed, here we
further extend this concept to general programs expressed in MIRs.
For instance, we consider the basel example in Figure 4.1. If our input is n = 10, then
the LI on the variable x is that its value is an integer, and is bounded by [0, 9] on loop
entry, whereas on loop exit, x is always equal to 10. The reason for inferring the LI is as
follows. Since we optimize the fixpoint expression’s child nodes in a bottom-up hierarchy,
130 Chapter 4 Numerical Program Optimization
the optimization of µs precedes
fix
b µs x
itself. Hence, we use the LI as the input state
to optimize µs, as the LI encompasses all possible program states the loop body µs will
encounter when executed.
Our accuracy analysis of fix
b µs x
follows the above pattern. Initially, we start with an




0 is split into two disjoint parts, namely,
σ]0|b and σ]0|¬b, they respectively satisfy and violate the Boolean constraint b. The state
σ]0|b represents all possible program states that enters the loop µs, so σ]1 = Em [µs]σ]0|b
captures all possible program states after the loop body. This procedure is repeated
for iterations k = 1, 2, 3, . . ., until a certain iteration n, where σ]n = σ
]
n−1. Hence, we
can obtain the LI by computing σ]0|b unionsq σ]1|b unionsq · · · unionsq σ]n|b, and the loop exit states with its
counterpart, i.e. σ]0|¬b unionsq σ]1|¬b unionsq · · · unionsq σ]n|¬b. Here, the loop exit state collects all possible
program states on loop exit. The meaning of unionsq operator on states is similar to joining
intervals and value-error bounds, which is defined to join the two value-error bounds in
respective states for each variable, which is defined as follows, where σ]a, σ
]
b ∈ ΣE]:
σ]a unionsq σ]b = [x 7→ σ]a(x) unionsq σ]b(x)]x∈Var. (4.27)
Alternatively, we can compute the LI as the LFP of g : ΣE] → ΣE] , where:






This LFP above can be computed with the algorithm in Figure 4.4. The value ⊥ indicates
an empty or unreachable state, and for any state σ] ∈ ΣE] , we have ⊥ unionsq σ] = σ]. The
return values σ]LI and σ
]











σ] to signify the result σ]LI.
Our method extends the iterative method we have previously explained in Section 2.3 of
Chapter 2, by not only evaluating the LI, but also the loop exit state. Because this iterative
process may not terminate for non-terminating loops, we introduce a parameter max_iter








σ]LI ← σ]LI unionsq σ]tt
σ]LE ← σ]LE unionsq σ]ff
σ]k+1 ← Em [µs]σ]tt
if σ]k+1 = σ
]





k ← k + 1
end loop
end function
Figure 4.4. The accuracy analysis of a fixpoint expression.
to limit the number of iterations; if this limit is reached, the tool will produce a warning
to indicate that the analysis may never terminate and thus it may be inaccurate. Widening
operators [CC04] are additionally used to accelerate the fixpoint computation.
4.4 Resource Usage Analysis
In this section we give a detailed explanation of how resources in MIRs can be shared
and how to analyze the resource utilization of MIRs.
Expressions can have common subexpressions, and eliminating them reduces resource
usage. We identify and eliminate common subexpressions when we construct DAGs
from programs. Resource statistics can be estimated by accumulating LUT and DSP
counts of each operator in the DAG. However, we can further merge multiple nodes
into one to reduce the estimated resource usage of generated code. Conventional
compiler optimizations [Kuc+] such as branch and loop fusion, as well as redundant
code elimination can be applied in an equivalent fashion to our MIRs to share conditional
and fixpoint operators. This process further reintroduces control structures to MIRs, so
that the code generation stage can make use of the result to synthesize code without
redundant control structures.
132 Chapter 4 Numerical Program Optimization
4.4.1 Sharing conditional expressions
In its essence, the sharing of the conditional operator is equivalent to branch fusion. For
example, we consider the MIR of the program in Figure 4.5a. The MA of it produces the
MIR in Figure 4.5b.





















Figure 4.5. The sharing of conditional expressions in a simple program.
Because we compute an abstraction of the program, the MIR does not keep the structure
of the if statement to allow them to be optimized separately, as doing this would allow
our optimization to produce more accurate implementations. The resulting MIR of the
program consists of two conditional expressions as shown in Figure 4.5b. Because of
this, after optimization, the MIR may has duplicate control paths. To resolve this, we
introduce new kinds of nodes, as shown in Figure 4.5c, to “bundle up” more than one
conditional expressions, when they all have the same Boolean expression.
4.4.2 Resource sharing in composition expressions
Composition expressions can be fused in an analogous manner. Any MIRs µ1 and µ2
can be merged to form a new MIR if there are no conflicts in the variable-expression
pairing, i.e. for any variable x that is assigned an expression in both µ1 and µ2, µ1(x) is
equal to µ2(x). A “bundle” of expressions can also be created to denote the sharing of the
composition operator. For example, a simple MIR in Figure 4.6a can be restructured by
fusing both variables x and y.
4.4 Resource Usage Analysis 133









z × x +
z x 1
(b)After fusion.
Figure 4.6. The sharing of composition expressions.
Resource sharing in fixpoint expressions
For fixpoint expressions that represent while loops, discovering resource sharing oppor-
tunities is a little more complex. We fuse fixpoint expressions in an analogous fashion as
described above for conditional expressions, and it is related to the conventional loop
fusion compiler optimization. For instance, the program in Figure 4.7a has the MIR in
Figure 4.7b, where:
b = x > y, µ1 = [x 7→ x/2], and µ2 = [x 7→ x/2,z 7→ z + x]. (4.29)
During program optimization, the two fixpoint expressions are optimized individually,
because loop splitting could enhance the accuracy of both x and z. After this, in the
resulting MIR, fixpoint expressions may or may not share computations. The fixpoint
expressions can often be fused together to save resources in the conditional expressions
and the control logic that are formerly duplicated, if they share common computations
that can be merged without conflict. For instance, Figure 4.7c shows how two expressions
can be fused into one fixpoint expression.
while (x > y) {
z = z + x;









x z y y
fix
b µ2 x z
(c)After fusion.
Figure 4.7. The sharing of fixpoint expressions in a simple example program.
134 Chapter 4 Numerical Program Optimization
However in some cases the above transformations could create cyclic graphs. To illustrate
this, consider the following conditional expression fusion example in Figure 4.8a. By


















(b)Fusing the two conditional operators
a and b creates a cycle.
Figure 4.8. Example branch fusion creates a cycle.
Not only does conditional expression fusion create dependency cycles, all of these above
transformations when performed in conjunction also have the potential to generate
cycles and there are potentially multiple ways of restructuring and combining fixpoint
expressions, such cycles in MIRs cannot be directly translated back into the source syntax.
In our example, only one of either a or b can be fused without generating a dependency
cycle in the MIR. There is more than one way of restructuring an MIR, and each way could
produce a different MIR that generates a different program. It is therefore necessary to
be specific about the strategy to make this procedure deterministic. Our method explores
fusion opportunities with a depth-first traversal of the MIR. While traversing nodes, we
attempt to form the maximal sharing for the current node; when a cycle is created, the
algorithm backtracks to the previous acyclic version. This process is deterministic and
ensures the order of resource sharing.
Resource counting
Finally, after applying the resource sharing transformations outlined above, we count
the number of LUTs and DSPs, by accumulating the statistics for each operator instance
in the MIR, in the same way described in Section 3.2 of Chapter 3. For this resource
estimation, we gather statistics of individual operators by synthesizing each arithmetic
operator (addition, subtraction, multiplication, division and comparison), for single-
4.4 Resource Usage Analysis 135
and double-precision floating-point, using Altera’s floating-point megafunctions [Alt13].
Similarly, the resource usage of integer arithmetic, conditional and fixpoint operators
are synthesized and estimated from multiplexers, and the composition operator uses no
resources because it does not perform computations.
4.5 Equivalent Structure Analysis
The next step is to use the analyses of accuracy and resource usage of equivalent structures
in MIRs to efficiently discover optimized equivalent MIRs. In this section, equivalent
relations from Section 3.3 of Chapter 3 are extended for control-flow structures, and we
guide this process efficiently with our analyses of accuracy and resource usage.
We start by taking one step further from simple arithmetic equivalence relations such
as associativity, commutativity and distributivity described in Section 3.3 of Chapter 3.
We define additional equivalent relations for composition, conditional and fixpoint ex-
pressions to fully enable the equivalence transformations of MIRs. Then we go on to
improve the methods described in SOAP to explore more equivalent structures, but still
in an efficient and scalable way.
4.5.1 Equivalence Relations
In Chapter 3, the SOAP framework formally defines a set of equivalent relations:
≡ ⊂ AExpr×AExpr, (4.30)
for discovering equivalent arithmetic expressions. This equivalent relation ≡ consists
of arithmetic equivalence rules such as associativity, commutativity and distributivity,
as well as reduction rules that propagate constants and simplify expressions. We now
expand it by making ≡ a subset of M ×M :
≡ ⊂M ×M, where M = SemExpr ∪MIR, (4.31)
136 Chapter 4 Numerical Program Optimization
by defining additional rules for ≡, which relates equivalent semantic expressions that
make use of the additional operators introduced in this chapter.
For the following rules, we assume b, b1, b2 ∈ BExpr, e, e1, e2 ∈ SemExpr, µ ∈MIR,
and ⊗ ∈ {+,−,×, /}, and each rule has its inversed version.
Boolean Operators
Because in this chapter boolean expressions are introduced, we also make use of standard
boolean equivalences, which include commutativity b1 ⊗ b2 ≡ b2 ⊗ b1, associativity
(b1⊗b2)⊗b3 ≡ b1⊗ (b2⊗b3), distributivity (b1∨b2)∧b3 ≡ (b1∧b3)∨ (b2∧b3), de Morgan’s
laws, where b1, b2, b3 are boolean expressions, ⊗ ∈ {∨,∧}, and ∨ and ∧ are boolean or
and and operators respectively. Finally, reduction rules are also used to simplify boolean
expressions, which include contradiction b ∧ ¬b ≡ ff , tautology b ∨ ¬b ≡ tt, and constant
propagation.
Ternary Conditional Operator
The ternary conditional operator has the following distributivity rules. Firstly, for unary
and binary arithmetic operators:
− (b ? e1 : e2) ≡ b ?−e1 :−e2,
e⊗ (b ? e1 : e2) ≡ b ? (e⊗ e1) : (e⊗ e2) ,
(b ? e1 : e2)⊗ e ≡ b ? (e1 ⊗ e) : (e2 ⊗ e) .
(4.32)
Secondly, we can distribute over the conditional operators:
b1 ? (b2 ? e1 : e2) : e ≡ b2 ? (b1 ? e1 : e) : (b1 ? e2 : e) ,
b1 ? e : (b2 ? e1 : e2) ≡ b2 ? (b1 ? e : e1) : (b1 ? e : e2) .
(4.33)
4.5 Equivalent Structure Analysis 137
Thirdly, we introduce additional reduction rules to simplify ternary conditional expres-
sions if possible:
tt ? e1 : e2 ≡ e1, ff ? e1 : e2 ≡ e2. (4.34)
Composition Operator
Similarly, the composition operator can also be distributed across arithmetic and condi-
tional operators:
(−e) ? µ ≡ − (e ? µ) ,
(e1 ⊗ e2) ? µ ≡ (e1 ? µ)⊗ (e2 ? µ) ,
(b ? e1 : e2) ? µ ≡ (b ? µ) ? (e1 ? µ) : (e2 ? µ) .
(4.35)
Fixpoint Operator
The fixpoint operator represents loops, and loops can be partially unrolled, similarly,
we can define a set of equivalence relations to perform partial unrolling on fixpoint
expressions:
fix (b, µ,x) ≡ fix
(
b, pkµ (µ) ,x
)
for k ∈ N. (4.36)
Here x ∈ Var and pµ is a function that computes the unrolling of the loop MIR µ, where
pµ(µ′) =
[
x 7→ (b ? µ) ? (µ′(x) ? µ) : µ(x)]x∈Var. (4.37)
This set of rules formally defines the partial unrolling of loops with a certain number
of steps k. Using the rules to unroll the loop “while (b) {s}” where b ∈ BExpr,
s ∈ Stmt is equivalent to unrolling the loop syntactically as follows, as an example, we
consider cases when k = 0, 1, 2 in Figure 4.9.
The next set of rules in (4.38) extends the structural induction rules for arithmetic
expression in (3.21) of Chapter 3 to inductively discover equivalent structures in child






















Figure 4.9. A transformed while loop with different partial loop unroll depths.
expressions. In these rules, recall that the formulae above the line are the premises,
whereas the ones below the line are conclusions. If the premise is true, then the conclusion
must be true. For instance, in a conditional expression x < y ? (x+y) +z :x, because the
subexpression (x+y) +z is equivalent to x+ (y+z), it also has an equivalent expression
x < y ? x + (y + z) : x. We define similar rules respectively for the conditional, fixpoint
and composition operators, and finally we also have an analogous rule for MIRs, which
formalizes that if any subexpressions µ(x) of an MIR µ has an equivalent expression ex,
then the MIR also has an equivalent MIR by replacing the expression µ(x) with ex.
b ≡ b′ e1 ≡ e′1 e2 ≡ e′2
b ? e1 : e2 ≡ b′ ? e′1 : e′2
,
b ≡ b′ µ ≡ µ′
fix (b, µ,x) ≡ fix (b′, µ′,x) ,
e ≡ e′ µ ≡ µ′
e ? µ ≡ e′ ? µ′ ,
∀x ∈ Var : ∃ex ∈ SemExpr : (µ (x) ≡ ex)
µ ≡ [x 7→ ex]x∈Var .
(4.38)
4.5.2 Discovering Equivalent Structures Efficiently
As we discussed earlier, discovering the full set of equivalent expressions by finding
the transitive closure of the relations is infeasible because of combinatorial explosion.
Chapter 3 proposed a method to drastically reduce the space and time complexity of
discovering equivalent expressions, while achieving high quality optimizations.
We base our equivalent expression discovery on this method, but extend it to support
not only simple arithmetic expressions but also additional program transform features
4.5 Equivalent Structure Analysis 139
proposed in this chapter, i.e. conditional, composition and fixpoint expressions, and MIRs.
This enables full program transformations.
This section provides an informal overview of the equivalent discovery procedure. In
Appendix B, we discuss the formal definition of this procedure for all semantic expressions
and MIRs by extending the optimization function O [e]σ] proposed in Section 3.3 of
Chapter 3.
Ternary Conditional Operator
For conditional expressions of the form b ? e1 : e2, we first optimize the Boolean expression
b. Then we use each equivalent expression of b to constraint the program state for the
optimization of e1 and e2. After optimizing child nodes, we combine the equivalent
expressions of these three child nodes to form a set of equivalent conditional expressions.
We further discover additional equivalent expressions from this set, and finally keep only
those that are Pareto-optimal.
Composition Operator
Expressions of the form e ? µ are optimized by first finding equivalences of µ, and then
for each of them discovered, we compute a new program state σ]0 = Em [µ]σ], and use
it to optimize e. Finally, all optimized combinations of e and µ form a complete set of
expressions equivalent to e ? µ, and Pareto-suboptimal candidates are pruned from this
set.
Fixpoint Operator
Fixpoint expressions are optimized in three steps. Initially, partially unrolled versions are
discovered using the rules defined in (4.36), the expressions are partially unrolled by
a factor up to 3. Then for each unrolled version, we obtain its loop invariant using the
algorithm illustrated in Figure 4.4, and the loop invariant is then used to optimize its child
140 Chapter 4 Numerical Program Optimization
nodes, which are the Boolean expression and the loop MIR. Finally, a set of equivalent
fixpoint expressions can be derived by combining equivalent child nodes together, and
we eliminate Pareto-suboptimal fixpoint expressions from this set.
MIR
MIRs can be optimized by first optimizing their expressions individually, then constructing
a set of MIRs by enumerating all combinations of optimized expressions for each variable.
The size of the final set is then reduced by pruning Pareto-suboptimal candidates.
However, because a MIR consists of multiple expressions and each has its own accuracy,
we need a strategy to quantify the accuracy of a MIR. In Section 3.3.5 of Chapter 3,
we discuss the reasons for this requirement, and various strategies to optimize multiple
expressions, which are readily extendible to MIRs.
4.6 Code Generation
The final stage is to translate the optimized MIR back to a program in its original syntax.
As discussed earlier, the MA produces an abstraction of the program, which means there
are generally many ways of generating different programs from the same MIR. For this
reason, certain heuristic optimizations are performed before or during code generation,
such as branch- and loop-fusion transformations explained in our resource usage analysis
to produce a unique and deterministic translation from the MIR.
Our code generation is carried out in three stages. The first stage applies the transforma-
tions outlined in Section 4.4 to perform loop- and branch-fusion, and to allow sharing of
expressions across nested MIRs.
After the first stage, we create a topological sort of all nodes, which produces a linear
ordering of all nodes such that the control- and data-dependences are preserved. We
then perform a simple one-to-one mapping from the list of nodes to program code. An
arithmetic node is translated into an assignment statement which assigns a temporary
4.6 Code Generation 141
variable with the result of the arithmetic computation. A ternary conditional and a fixpoint
node respectively generate an if statement and a while loop. Finally, a composition
node ?
e µ
ensures that µ is generated before e.
The final and optional step of our code generation, is to perform code sinking, which
moves parts of the code so that when their results are not needed, they are not exe-
cuted [LA]. For instance, the result of the statement “ y = x + 1;” is only used in the
true-branch of the program:
y = x + 1;
if (x < 1)
y = y * 2;
else
y = x;
This statement can therefore be moved into the true-branch of the if statement, so
it may be evaluated only when needed. This final step further allows HLS tools to
apply if-conversion to revert the changes if the tool believe this improves the resulting
circuit.
4.7 Evaluation
In this section we optimize a set of numerical programs using SOAP in a benchmark suite
with six different examples. We use the IEEE 754 32-bit single precision format with
rounding to nearest mode as the data types of the floating-point values used in these
examples. The benchmark consists of two introductory numerical programs, and four
real applications that are frequently encountered in numerical analyses, where round-off
errors could have big impacts on the quality of their execution. Appendix C contains the
source code of all benchmark examples below.
• simple: for an input x ∈ [0, 20], we repeatedly multiply it with 0.9, until the result
is less than or equal to 1, the number of iterations is dependent on x and can only
be determined by analyzing the program.
142 Chapter 4 Numerical Program Optimization
• basel: the example in Figure 4.1.
• taylor: the Taylor expansion of cos(x + y), with single precision inputs x ∈
[−0.1, 0.1] and y ∈ [0, 1], and an integer n ∈ [10, 20], which determines the bound
on the iteration count.
• filter: computes the unit step response of a 3rd-order IIR filter, where inputs
are bounded by [0, 1] and all coefficients are bounded by [0, 0.2], and it has a fixed
iteration count 20.
• euler: it uses Euler’s method to solve the differential equation of a harmonic
oscillator x¨+ ω2x = 0, with both an initial stationary position x and ω2 bounded by
[0.0, 1.0], a step size of 0.1, and an iteration count n ∈ [0, 20]. It returns the position
x and velocity x˙.
• pid: the example proportional–integral–derivative (PID) controller that was used as
a case study in [Dam+14] as their motivation of automated accuracy optimization of
numerical programs, we make it more challenging by changing constant coefficients
to be bounds to model not only one, but a large selection of PID controllers, with
kp ∈ [9.0, 10.0], ki ∈ [0.5, 0.7] and kd ∈ [0, 3], and an iteration count n ∈ [0, 20].
Our configuration for the efficient discovery of equivalent expressions uses a depth limit
k = 2, and a D = 3 as the maximum number of times of partial loop unrolling. For each
program we optimize it with SOAP to discover a wide range of implementations, and
select the most accurate and least resource demanding equivalent implementations for
further analysis.
Our code generator produces C source codes from the optimized MIRs, which are then
synthesized with LegUp [LU], with floating-point operator sharing turned off to achieve
maximum frequency, then compiled and verified with Quartus [Alt10], targeting an Altera
Stratix IV device (EP4SGX530) [Alt16] for the actual resource usage statistics and the
frequency achieved.
4.7 Evaluation 143
In order to provide evidence that the minimization of the statically analyzed round-
off errors strongly correlates with the reduction of actual errors encountered during
computation in circuits, the selected optimization candidates are also simulated for
numerical accuracy. This simulation is carried out 1000 times with 1000 uniformly
distributed random inputs for each benchmark example, then we compare the maximum
round-off errors encountered.
4.7.1 Results
Table 4.1 shows the results of optimizing the above benchmark examples. Here we
explain the meaning of each row and column.
The rows labeled “FR” and “MA” respectively show the statistics for the most resource
efficient, and the most accurate implementations.
The column “Time (s)” shows the time required for the optimization. The optimization
runtime is longer than a typical compiler optimizer, because during equivalent structure
discovery, a significant amount of equivalent structures are examined.
The column “PF” shows for each benchmark example the total number of trade-off options
in the Pareto frontier. Larger number indicates more choices to trade-off accuracy and
area.
The values shown in the “Error Bound” column are the maximum absolute errors found
by SOAP, while “Simulation” shows the actual absolute value bound on round-off errors
found during simulation, and the percentage shows the actual accuracy improvement
found in simulation of “MA” over “FR”. For each of these benchmark problems, the
optimization for accuracy of them correlates to the reduction of simulated round-off
errors in actual executions. They all show improvements in accuracy over the original
by up to 65% in actual execution. The analyzed round-off errors are larger than the
simulated bounds, because our efficient accuracy analysis over-approximates the worst-
case round-off errors.
144 Chapter 4 Numerical Program Optimization
The “Resources” columns show our estimation of the number of LUTs and DSPs required
for each of these programs, and the numbers in brackets are the corresponding statistics
obtained from Quartus synthesis.
Table 4.1. Table of optimization results.
Fmax
Error Bound Simulation LUTs DSPs MHz
FR 7.1786E-06 6.8728E-06 241 (415) 4 (4) 386.3
MA 5.8827E-06 5.8262E-06 15.2% 932 (1264) 16 (16) 318.9
FR 2.9340E-06 3.2456E-07 3988 (4081) 0 (0) 306.3
MA 1.1733E-06 1.4871E-07 54.2% 28648 (29136) 0 (0) 279.0
FR 2.0482E-07 3.5672E-08 1596 (2116) 24 (24) 308.7
MA 1.5756E-07 2.077E-08 41.8% 6661 (7489) 112 (112) 174.7
FR 7.5396E-07 1.0960E-07 3470 (4529) 24 (24) 289.4
MA 5.2858E-07 6.7255E-08 38.6% 21041 (26190) 104 (104) 172.0
FR 9.0607E-05 2.4342E-07 1532 (2362) 12 (12) 327.7
MA 4.8777E-05 8.549E-08 64.9% 24314 (32637) 128 (128) 187.1
FR 5.9869E-04 7.4389E-07 3975 (5576) 24 (24) 245.4

























Because our benchmark examples are designed to be resource efficient, there is no
room for resource usage optimization of the original program. However we are able
to consistently reduce the resource usage of a plain partial loop unrolling by more
than 25%, because our optimization can discover subexpression sharing opportunities,
propagate constants values, and also aggressively reduce the size of expressions by
powerful reduction rules such as e− e = 0 and 0× e = 0.
Besides the choices of implementations that are either most accurate or most resource
efficient, each optimization also offers a wide selection of optimized programs on the
Pareto frontier. For instance, Figure 4.10a shows the Pareto frontier of euler, which
has 26 different trade-off options. Furthermore, in the optimization of euler, our
optimization not only identifies that it is resource efficient when the two return variables
are computed by the same loop, but also by individually optimizing the accuracy of the
two variables, we produce a program with two loops, each with a different goal, that
is to compute their respective return variables as accurately as possible, this generated
a program that consists of two loops that have completely different structures. With
this, we further widen the trade-off curve with the most accurate option improving the
4.7 Evaluation 145
accuracy by 65%. In Figure 4.10b, because the loop kernel of filter has the expression:
2∑
i=0
(aiyi+1 + bixi), (4.39)
which has a large number of equivalent expressions, even without increasing the resource
usage, our optimization improves its accuracy by 14.5%.
Moreover, points within the shaded region are Pareto-optimal as they optimize DSP count,
since our Pareto frontier has three dimensions, which are respectively accuracy, LUT
utilization and the number of DSPs.
4.7.2 Quality of Resource Estimation
The Pareto frontier is sensitive to how the quality metrics of each point (e.g. LUTs) compare
against each other (i.e. the rank), while the actual number of the metrics is irrelevant.
The Pareto frontier therefore is unaffected if the quality metrics are perturbed by a small
error such that the rank is unchanged. To ensure that the resource estimation method
used in our optimization can identify accurately whether an actual implementations is on
the Pareto frontier, we gathered 150 implementations discovered across the benchmark
examples, and for each one we rank its number of estimated LUTs among them, and
do the same for actual LUTs. Figure 4.11 plots for each implementation, the rank of
estimated LUTs against the rank of actual LUTs, which shows ranking of the resource
estimation we produce is very close to the Quartus synthesized circuits.
4.8 Summary
A new method is proposed and carried out in SOAP that performs general numerical
program transformation for the trade-off optimization among accuracy, and two resource
related metrics, LUT and DSP utilizations. To optimize a numerical program, it starts by
abstracting the program into a MIR, which we designed to extract the essence of execut-
ing the program, and removes unnecessary informations such as temporary variables,
interleaving of non-dependent statement, etc. The MIR is then optimized efficiently by
146 Chapter 4 Numerical Program Optimization
discovering a wide range of equivalent trade-off implementations of the original MIR.
An optimized MIR can then be chosen to be translated into a numerical program. By
using SOAP, we optimize the accuracy of our sample applications by up to 65% in actual
program executions.
In Section 4.7, our experiments show that accurate implementations of numerical al-
gorithms often increase their number of arithmetic operations for reduced round-off
errors. In general, this could result in circuits that have longer wall-clock time. However
with greater area budget, we could have a greater freedom in rewriting programs to
have greater throughputs. It is also observed that even a small program requires a long
optimization time, which prevented us from exploring partial loop unrolling depths
greater than 3, and hindered better accuracy improvements. In addition, the optimized
programs may require a substantial increase in resource budget, because they do not
share resources among different clock cycles. This will be addressed in the following
chapter by introducing a new resource estimation analysis to allow operations to be
shared temporally. In the next chapter, we therefore propose refinements for a faster
optimization to explore deeper loop unrolling, and explore how to optimize numerical
programs consisting of pipelined loops to trade off accuracy, resources and latency.
4.8 Summary 147






































Figure 4.10. The Pareto frontier.
0 20 40 60 80 100 120





















Figure 4.11. The quality of resource estimation.
148 Chapter 4 Numerical Program Optimization
5Accurate and Resource Efficient
Pipelining of Numerical Programs
Numerical C programs typically spend most of their time in loops. For this reason, HLS
tools adopt state-of-the-art polyhedral compilation techniques [Can+14] to synthesize
loops to run as fast as possible. This is achieved by pipelining them to maximally exploit
parallelism across loop iterations. Certain program transformations, such as conventional
program equivalences (e.g. partial loop unrolling and array access pattern changes) are
highly ubiquitous in their compilation process.
However, their ability to perform pipelining, even with any combinations of these equiva-
lences, is fundamentally constrained by data-dependences that are carried across itera-
tions, i.e. inter-iteration dependences. To relax these constraints, we must use equivalence
rules in real arithmetic (e.g. associativity and distributivity), in tandem with the con-
ventional rules above to enable much more efficiently pipelined RTL designs. A simple
example of this is the summation of all elements in an array:
float sum = 0;
for (int i = 0; i < N; i++)
sum += a[i];
This code can be partially unrolled and the sequence of additions can be rewritten using
tree adders to reduce its latency, but we will see later in Section 5.6 that more efficient
implementations are possible.
In contrast to the expression balancing optimization pass in VHLS, the new SOAP in this
chapter automatically produces results that are significantly better thanmanually tuning
partial unrolling factors and expression balancing#pragmas in VHLS, because it is fully
aware of how data-dependences are carried across iterations, and uses this to steer the
149
optimization process. SOAP is also concious of the impact these transformations could
have on round-off errors, and minimizes them in the optimization process, as we treat
numerical accuracy as one of the three simultaneous objectives. Furthermore, VHLS only
generates one result which does not necessarily improve over the original code.
The technical work presented in Chapters 3 and 4 lays the necessary foundation for the
new methods proposed in this chapter. Firstly, we exploit the SOAP framework’s ability
to analyze the numerical accuracy of a given program. Secondly, the framework provides
the basis for resource analysis, as common subexpression sharing can be detected by the
method detailed in Section 4.4 of Chapter 3. Thirdly, we can make use of MIRs to explore
program rewrites. Finally, the efficient algorithms for equivalent program discovery in
SOAP can be readily used.
In previous chapters, we only analyze structural reuse in the form of common subexpres-
sions. HLS tools further allows certain arithmetic operations to be shared temporally. In
this chapter, we therefore additionally analyze resource utilization by considering the
implications of sharing the same resources among different clock cycles. For this purpose,
we use a variant of the first few steps shared by modulo SDC scheduling [Can+14]
and iterative modulo scheduling (IMS) algorithm [Rau94], which efficiently analyze the
run time and resource utilization of a given program, by computing fundamental lower
bounds of these metrics.
SOAP is evaluated on a suite of 11 programs from the Livermore Loops [DL11] and
PolyBench [Pou] benchmark suites. Our tool obtained a wide selection of Pareto-optimized
programs. Programs with the best latency obtained speedups of up to 12× (7× on average
across the suite), and increases in accuracy of up to 7× (2.7× on average), while using up
to 4× (2.5× on average) more LUTs. We were unable to decrease the resource utilization
in any of the benchmarks, as they have no redundant computations.
The contributions of this chapter include:
• An extended suite of program equivalence rules, which introduces access reduction
rules that removes extraneous array accesses. This chapter provides evidence that
150 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
standard program equivalence techniques that do not affect program behavior,
e.g. partial loop unrolling and access reduction rules, can give rise to the freedom
for non-standard transformation rules, e.g. arithmetic rules, to significantly impact
latency, resource usage and accuracy in a loop (Section 5.4.2).
• A new scheduling analysis that estimates the latency and resource usage of a given
optimized candidate (Section 5.5). The resource usage analysis not only identifies
common subexpressions, but also opportunities to share arithmetic operations
temporally.
• A significantly faster efficient discovery of equivalent programs through a faster
accuracy analysis that analyze a fraction of loop nest executions, speed up analysis
of loop nests without inter-iteration dependences (Section 5.5.3), graph partitioning,
and intelligent pruning of optimization candidates (Section 5.4.1).
• Incorporating the above-mentioned techniques, SOAP is now capable of automati-
cally and safely producing optimized programs (and subsequent RTL implementa-
tions with VHLS) on the three-dimensional Pareto frontier of options that trade off
run time, accuracy, and area. Its improvements in latency are notably better than
the only ones produced by VHLS’s unsafe optimizations. SOAP is further evaluated
on a suite of Livermore Loops and PolyBench benchmarks (Section 5.6).
This chapter, which is a natural extension to previous chapters, is organized as follows.
Section 5.1 details how a simple numerical program can be optimized to run efficiently
as our motivating example. Our automatic optimization process consists of three major
steps. It starts by the process of metasemantic analysis, which takes as an input the
original numerical program written in C, and translates it into a MIR. Section 5.3 explains
how this process is extended to multi-dimensional arrays. We then discover equivalent
MIRs using our efficient equivalent program discovery procedure, which produces a
Pareto frontier of optimized MIRs. Section 5.4 discusses the improvements made to this
procedure to further increase its performance. This process in turn makes use of the two
new performance analyses in Section 5.5, which respectively estimate the latency and
151
resource utilization. This section further explains how round-off errors can be bounded in
a given program with arrays. The optimized C programs can then be generated from the
MIRs, using the code generation routines from Chapter 4, to be synthesized in VHLS to
obtain RTL implementations. In Section 5.6 we evaluate the results of optimizing a suite
of benchmark examples extracted from PolyBench [Pou] and Livermore Loops [DL11].





















Figure 5.1. An overview of our automatic program optimization process. The shaded region
shows our internal tool flow.
5.1 Motivation
Figure 5.2 gives an implementation of the Seidel stencil computation, extracted from
PolyBench [Pou], where initially all values in the array A are single-precision floating-point
values between 0 and 1. It resembles the typical code frequently used in fluid dynamic
simulations for solving partial differential equations and systems of linear equations.
152 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
#define N 1024
for (int t = 0; t < 20; t++)
for (int i = 1; i < N-1; i++)
for (int j = 1; j < N-1; j++)
A[i][j] = 0.2 * (A[i-1][j] + A[i][j-1] +
A[i][j] + A[i][j+1] + A[i+1][j]);
Figure 5.2. An excerpt from the Seidel stencil [Pou]. The inter-iteration data-dependence of the
innermost loop is underlined (A[i][j] and A[i][j-1]).
We start by synthesizing this program in VHLS. We enable loop pipelining in VHLS, which
asks it to optimize the loop by overlapping its iterations. However, we can observe that
this program has very limited opportunity for pipelining, because each iterationj of the
innermost loop ends by writing to A[i][j], and the next iteration j+1 begins by reading
from A[i][j]; this inter-iteration dependence is highlighted in Figure 5.2. Hence, it
serves as our example to motivate a better SOAP to efficiently pipeline loops.
VHLS generates a schedule where the depth of the loop D is 49, and II as enforced by the
data-dependences above is 46. The trip count of the innermost loop is N = 1022. The
overall latency of the innermost loop is therefore ((N − 1)× II) +D = 47, 015 cycles.
We then enable VHLS’s expression balancing (EB) optimization. When synthesized, this
optimization pass tries to reorder the sequence of additions in the loop body into a tree
structure, thus reducing the II to 28 cycles, and the depth D to 42 cycles, while the trip
count N = 1022 remains the same. The overall latency is now ((N−1)×II)+D = 28, 630
cycles. The overall resource usage remains roughly the same.
However, as mentioned in Section 2.5 of Chapter 2, VHLS’s EB has two shortcomings.
Firstly, it is not aware of the inter-iteration data-dependence and misses the opportunity
to further pipeline this loop. Secondly, and most importantly, VHLS does not guarantee
that this optimization will not result in catastrophic numerical inaccuracies.
We further discover that if the loop is partially unrolled, VHLS’s EB did not improve the
total run time, despite using a lot more resources. Additionally, EB only makes use of
5.1 Motivation 153
associativity, but not other equivalence rules. These limitations pose great restrictions on
VHLS’s ability to produce a significantly faster implementation.
We then use the enhanced SOAP of this chapter to automatically discover equivalent
programs from the program in Figure 5.2. Because SOAP explores a large number of paths
that lead to a Pareto frontier of implementations, here we illustrate one of the many paths
that could be taken by minimizing latency, while trying to optimize accuracy and resource
usage. By using just arithmetic equivalences, SOAP specifically applies transformations
to alleviate the constraints on the inter-iteration dependence, and discovers that the
innermost loop can be rewritten to minimize latency as shown in Figure 5.3.
for (int j = 1; j < 1023; j++)
A[i][j] = 0.2 * (A[i][j-1] +
((A[i][j] + A[i][j+1]) +
(A[i+1][j] + A[i-1][j])));
Figure 5.3. The optimized program using only arithmetic equivalences.
Although this loop still has a data-dependence between consecutive iterations, this
transformation greatly reduces latency because most of the loop iterations can now be
overlapped. We find that this simple transformation can reduce II to 19, which speeds up
the original program by 2.3×, using almost the same number of LUTs and DSP elements
as the original program. At the same time, the sequence of additions are now reordered
to minimize round-off errors, improving the accuracy by 18%.
SOAP also supports more complex control-flow restructuring transformations, such as
partial loop unrolling, in tandem with rules that optimize memory accesses and arithmetic
calculations. This can further reduce the loop’s latency. In this example, unrolling the loop
by a factor of two (i.e. updating two matrix elements on every iteration and halving the
trip count) and applying other rules, results in a program with II = 19, D = 152, N = 511.
When implemented on a device it is 4.8× faster than the original, and almost twice as
accurate, at a cost of 17% more LUTs, as shown in Figure 5.4.
Further increasing the optimization effort, which enables the loop to be more deeply
partially unrolled, leads to a program that is 7× faster than the original, but uses 2.8×
154 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
for (int j = 1; j < 1023; j += 2) {
float t0 = A[i][j-1], t1 = A[i][j+1];
float t2 = (A[i][j] + t1) + (A[i+1][j] + A[i-1][j]);
float t3 = 0.04f * t2 + 0.2f *
((t1 + A[i][j+2]) + (A[i+1][j+1] + A[i-1][j+1]));
A[i][j] = 0.2f * (t0 + t2);
A[i][j+1] = 0.04f * t0 + t3;
}
Figure 5.4. The optimized program using arithmetic equivalences in tandem with control-flow
restructuring and memory access optimization.
LUTs. To summarize, in Table 5.1, we compare VHLS with EB, against one of the many
implementations that we have explored using SOAP with the increased optimization
effort. The three columns respectively shows the original program with loop pipelining
enabled, what VHLS can achieve alone, and the capability of SOAP. It is important to
note that the round-off error is unknown for VHLS with EB, because it cannot predict the
impact of its unsafe optimizations on accuracy. We performed place-and-route for exact
statistics.
Table 5.1. Comparison among the optimized implementations generated by VHLS’s expression
balancing and our optimizer. The row “Total run time (s)” indicates the wall-clock
time in seconds of running the synthesized circuits.
VHLS VHLS with EB VHLS with SOAP
Clock period (ns) 2.60 2.65 2.66
Inner latency (cycles) 47.0 k 28.6 k 6.59 k
Total run time (s) 2.50 m 1.56 m 0.358 m
LUTs 620 623 1778
DSP elements 5 5 8
Round-off error 10.68 µ unknown 4.31 µ
5.1 Motivation 155
5.2 Syntax Definition
Because we incorporate multi-dimensional arrays in our program optimization, this
section further extends the syntax definition in Section 4.1 of Chapter 4 to support array
data types. This can be achieved by modifying the definition of x in (4.1):
x ::= v ([a])? . (5.1)
Here, ([a])? indicates that the term [a] can be repeated zero or more than zero times,
and v is a variable and a is an arithmetic expression. The number of repetitions (N)
indicates the variable v refers to an N -dimensional array, and the absence (N = 0)
denotes a scalar. When x is used to declare an array, a must be a positive integer to
conform to the C syntax. To reflect this restriction and all above changes, the new syntax
definition for numerical programs is therefore:
a ::= n
∣∣∣∣ v ([a])? ∣∣∣∣-a1 ∣∣∣∣ a1 ⊗ a2, b ::= !b1 ∣∣∣∣ a1 ⊕ a2 ∣∣∣∣ b1  b2,
s ::= [t] v ([c])? [= a] ;
∣∣∣∣ s1 s2 ∣∣∣∣if (b) {s1} [else {s2}] ∣∣∣∣while (b) {s},
v ::= x
∣∣∣∣y ∣∣∣∣z ∣∣∣∣ . . . , t ::= int ∣∣∣∣float.
(5.2)
Here, c ∈ N is the set of positive integers. Recall that a, a1, a2 ∈ AExpr, b, b1, b2 ∈
BExpr, s, s1, s2 ∈ Stmt, v ∈ Var are respectively arithmetic expressions, Boolean
expressions, program statements, and variables of type t which is either int or float.
5.3 Extending MIRs with Arrays
In Section 4.2 of Chapter 4 we discussed a new IR for program optimization. However,
it did not include support for arrays in the original description of the MIR format. All
examples that motivate us in extending SOAP include arrays, so in this chapter, we extend
MIRs to be able to represent programs that not only use scalar values, but also single- or
multi-dimensional arrays.
156 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
In many imperative languages such as C, arrays are stateful objects, i.e. they have side-
effects and are used to store information, and changes to them are reflected to concurrent
parts of the program that may be oblivious to the changes. This characteristic is known
as the lack of referential transparency [SS90]. Such behavior is not present in arithmetic
expressions, many functional programming languages, SSA forms, as well as MIRs. This
proves to be a challenge to us, because our efficient program optimization requires referen-
tial transparency to reliably and recursively divide the program into smaller subprograms
that can be optimized independently, without affecting other subprograms.
To remedy this, we treat arrays as immutable. We use a function update(A, x¯, e) to return
a new array that is the same as A but with (multi-dimensional) index x¯ now containing
e. Similarly, the function access (A, x¯) returns the element of A at index x¯. As a simple









The implication of making arrays immutable is two-fold. Firstly, we disallow pointer
aliasing, i.e. float *b = a; is not allowed in the C code, to keep the translation simple.
However this is not a problem for us because the programs that can benefit from our
optimizations usually do not manipulate pointers. This issue can also be addressed in the
future by performing pointer analysis. Secondly, diverged paths in array updates could
occur if we naïvely optimize MIRs. For instance, if A is an input array, consider the two
expressions in a MIR, update (A, x¯, e) and update (A, x¯, e′), where e, e′ are equivalent.
They respectively update the x-th element of A with e and e′ and return different arrays.
A C program cannot be generated from this MIR without duplicating A. We solve
this problem by partitioning the MIR at “update” nodes using the method described in
Section 5.4.
To give an example, consider the program which computes the Fibonacci sequence in
Figure 5.5a, and it can be represented by the MIR shown in Figure 5.5b.
5.3 Extending MIRs with Arrays 157
for (int i=2; i<1023; i++)
{

















Figure 5.5. A simple example program and its corresponding MIR.
5.4 Structural Optimization
From a numerical program, we can generate a MIR using the translation process in
Section 5.3. The next step is to transform the MIR, and discover MIRs that are equivalent
to the original MIR in real arithmetic, but may execute differently in finite-precision
arithmetic because of round-off errors.
5.4.1 Improved Algorithm
As discussed in previous chapters, even a small expression could have a huge number of
equivalent ones. Exhaustively discovering all equivalent MIRs would result in combinato-
rial explosion of the number of equivalent MIRs in the search space. For this reason, we
base ourselves on an algorithm from Section 4.5 of Chapter 4 that searches efficiently by
discovering equivalences in a bottom-up hierarchy. In this section, we discuss two major
improvements to the algorithm which further increase its performance.
Partitioning
Instead of optimizing the MIR immediately, we start by partitioning the MIR into multiple
smaller sub-MIRs. The partition boundaries are determined by update operators. For
instance, we consider the partially unrolled Fibonacci example in Figure 5.6a. The MIR
158 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
of the loop body is shown in Figure 5.6b. The partition boundaries are indicated by the
region surrounded by the red dotted curve . A multiply shared subexpression, such
as i− 1, also determines the partition boundary by merging its partition with one of its
parents with the smallest partition by node count. If all parents contain the same number
of nodes then a choice is made randomly.





















(b)The MIR of the partially unrolled loop body.
Figure 5.6. An example to illustrate how MIRs are partitioned.
Because transformation rules can only be applied to each partition but not across them,
the size of the search space can be reduced further. In turn, each are optimized separately
and generate a set of partitions equivalent to the original. We then select combinations
from these partitions to be merged, this generates a set of MIRs that are equivalent to the
original. Finally, we preserve those MIRs merged on the Pareto frontier.
Optimization
Previously in SOAP, as we optimize parts of MIRs, the Pareto frontier is used to filter
discovered equivalent candidates (MIRs or semantic expressions), which keeps the size
of the set relatively small and manageable. As we optimize larger programs, the run
time of the tool increases significantly. Currently, the SOAP framework prunes the MIRs
that are Pareto-suboptimal, leaving only those that are on the Pareto frontier. However,
because our Pareto frontier is three-dimensional, there is a large increase in the number
of Pareto-optimal MIRs. This Pareto pruning approach is no longer feasible for our
5.4 Structural Optimization 159
benchmark examples. Therefore in this chapter, not only do we use the Pareto frontier to
filter candidates, we also introduce a PRUNE function to further reduce the size of Pareto
frontier.
We rely on the PRUNE function to efficiently steer the direction of our Pareto frontier as
we discover new candidates. It takes as an input the set of Pareto-optimal equivalent
candidates that we have discovered, and prunes elements in this set to reduce its size
by sampling, keeping the number of discovered MIRs tractable. The pruning algorithm
is inspired by Poisson-disk sampling algorithm [Bri07]. Our algorithm in Figure 5.7
starts by first randomly selecting one point from the Pareto frontier 0 (denoted by
RANDOMSAMPLE(0)). It then grows the set of points by adding the neighbours from the
point that are separated by at least a certain distance δ, where the distance δ is decreased
iteratively by a factor ζ = 0.8, until  contains at least η = 20% of all points in the Pareto
frontier, or a maximum number of attempts (attempt_count) is reached. The distance













where f ∈ F enumerates each function that evaluates the performance of a candidate e,
i.e. the function f ∈ F computes either the accuracy, area or latency of e.
This method is superior to random sampling, because random sampling often samples
points that are close together, which usually are very similar implementations.
We found that with all improvements above and a faster accuracy analysis in Section 5.5.3,
the algorithm is significantly faster than the original optimization algorithm in Sec-
tion 4.5.2 of Chapter 4. Even though this algorithm may discover potentially fewer
candidates on the Pareto frontier, we can now explore greater partial loop unrolling
depths to widen the swing of the Pareto frontier in the same amount of time.
160 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
function SAMPLE(0)
δ = 1.0
 = {RANDOMSAMPLE (0)}
for i = 1, 2, . . . , attempt_count do
for e ∈ 0 do
in_range = ff
for e′ ∈  do
















Figure 5.7. The algorithm used to sample the Pareto frontier.
5.4.2 Transformation Rules
This section details the new transformation rules in the equivalence relation ≡ and
consequently in the structural optimization relation  . Each transformation rule on its
own is not revolutionary, but for the first time, they are used in tandem with arithmetic
rules and control-flow restructuring rules introduced respectively in Chapters 3 and 4.
This enables a much better automatic structural optimization on the latency, resource
usage and accuracy of numerical programs, than is possible using only a subset of them.
In Chapters 3 and 4, SOAP provides a range of equivalence rules that are used in the
optimization, such as associativity, distributivity, commutativity, constant propagation,
and partial loop unrolling. In Table 5.2, we list those rules that proved effective when
minimizing loop latencies. Although these rules are used to transform MIRs, we present
before-and-after examples written in C to allow the effect of each rule to be readily
understood.
5.4 Structural Optimization 161
Table 5.2. Before-and-after examples to demonstrate the access reduction rules.
Access Reduction Rules
Multiple reads x=A[i--]; y=A[i+1];  x=A[i--]; y=x;
Multiple writes A[i++]=x; A[i-1]=y;  A[i++]=y;
Read after write A[i++]=x; y=A[i-1];  A[i++]=x; y=x;
Indep. accesses (where i 6≡ j) A[i]=x; y=A[j];  y=A[j]; A[i]=x;
Our new rules, the access reduction rules, with formal definitions below and examples in
Table 5.2, remove extraneous data-dependences that arise after partial unrolling. These
rules, along with partial loop unrolling, mostly do not really impact latency, because they
are very well studied in polyhedral loop dependence analysis, and tools such as VHLS can
make use of them automatically. However, they give the necessary freedom to arithmetic
rules to affect latency. The rules are as follows, where A is an array, ı¯, ¯ are subscripts,
and e, e′ are expressions:
• Multiple reads, eliminates the second of two reads of the same location. This arises
naturally from the MIR, as common subexpressions are shared.
• Multiple writes, eliminates a write that is overwritten:
update
(







• Read after write, eliminates a read from a location that has just been written:
access (update (A, ı¯, e), ı¯) e. (5.6)
• Independent accesses, allows two array operations to be reordered if it can be proved
that they never access the same location:
access (update (A, ı¯, e), ¯) access (A, ¯) , if ı¯ 6≡ ¯. (5.7)
We also visualize this rule in Figure 5.8, which shows a sample MIR transformation.












Figure 5.8. A sample MIR transformation using the independent accesses rule.
These rules may not seem powerful on their own, but when combined with other structural
rules, they enable SOAP to detect dependences that can be removed in the MIR. This in
turn allows more opportunities for the rules to further reduce loop latency. By way of
illustration, we optimize the Fibonacci series example program in Figure 5.5a for latency.
By partially unrolling the loop with a factor 2, we obtain the program in Figure 5.6a. We
can see that because of the rigid array access pattern, associativity cannot be applied
easily to the loop kernel. However, by applying the above access reduction rules first, we
give associativity the freedom to reduce latency by half and improve accuracy by 50%, as
shown in Figure 5.9.
for (int i = 2; i < 1023; i += 2) {
float t2 = A[i - 2], t3 = A[i - 3];
A[i - 1] = t2 + t3;
A[i] = 2 * t2 + t3;
}
Figure 5.9. The optimized program that computes the Fibonacci sequence. It reduces latency of
the original in Figure 5.5a by half and improves accuracy by 50%.
Without the above access reduction rules, it is therefore not possible to reach this opti-
mized implementation. Conversely, it is not possible to relax scheduling constraints due
to inter-iteration dependences without arithmetic equivalence rules, as these reduction
rules are there to assist transformation rules that make a difference in latency. Therefore
the rules in Table 5.2 must be used in conjunction with arithmetic and control-flow rules
to optimize latency in numerical programs.
5.4 Structural Optimization 163
5.5 Performance Analysis
This section explains how we analyze MIRs for our three performance metrics: latency,
resource usage, and accuracy.
5.5.1 Latency Analysis
The purpose of our latency analysis is not to create a complete scheduling of numerical
programs, as this would be computationally expensive, and would need to be repeated
for tens of thousands of equivalent programs. Instead, it computes a lower bound of
the loop’s II, the minimum initiation interval (MII). (Recall that the initiation interval
is the number of clock cycles that must elapse between the starts of two consecutive
loop iterations, and is determined by data dependences and resource constraints.) We
then compute the overall latency of the loop, and subsequently, the total latency of the
program.
Following LegUp [LU], we compute MII values using the first few steps of modulo SDC
scheduling [Can+14] introduced in depth in Section 2.2.3 of Chapter 2, by viewing
MIRs as dependence graphs, as the structure of a MIR already captures intra-iteration
data-dependences. In addition to this, we add extra latency information as attributes
on the edges of MIRs, and new edges to form cycles that capture inter-iteration data-
dependences. The analysis is carried out in three stages.
The analysis starts with the MIR of the loop under analysis. Each edge in the MIR, say
s → t, represents a data-dependence: the operation at node s must be evaluated fully
before the operation at t can begin. The first step is to add a pair 〈l, d〉 for each edge of
the MIR. Here, l is the latency of the edge (the number of clock cycles that must elapse
between the start of s and the start of t) and d is the dependence distance (the number
of loop iterations that must elapse between the start of s and the start of t). Because
all operations in the MIR are performed in a single iteration, all edges have d = 0. The
value of l is given by the latency of the operation at node s; if s corresponds to an input
variable or a numerical constant, then l = 0.
164 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
The second stage is to add edges to form a cyclic dependence graph that captures read
after write (RAW) dependences across loop iterations. This step involves checking whether
each pair of “access” and “update” nodes has a dependence, and if so, adding a new
edge between them with latency and dependence distance attributes. As an example,
consider the MIR in (5.3) and assume each iteration increments i by 1. Because in the
original program, A[i] and A[i+1] are respectively reading from and writing to the
same array A, we need to check if these accesses could touch the same memory location in
different iterations. For this, our analysis formulates an ILP problem for the dependence
distance, and solves it using the integer set library (isl) [Ver10]. In this example, the
dependence distance is 1 because the value written toA[i+1] in the current iteration i
is immediately used in the next iteration i+1. Similarly, we also add new edges for reads
and writes to the same variable, which can be treated as a special array with only one







〈0, 0〉 〈10, 0〉 〈7, 0〉
〈0, 0〉 〈0, 0〉 〈0, 0〉 〈2, 0〉
〈0, 0〉 〈0, 0〉
〈−2, 1〉
Figure 5.10. The MIR with edges labelled with latency attributes.
Note the new dashed edge from the update node to the access node, which is labeled
〈−2, 1〉. The first value, −2, signifies that the latency of the edge between × and access,
which is 2 cycles, is canceled out. This is because the multiplier can reuse its output
from the previous iteration as the input for the current iteration without requiring a
2-cycle delay to read from A for the value that was updated in the previous iteration.
The second value, 1, indicates that there is a data flow dependence from iteration i to
iteration i+ 1.
We assume no limit on the number of operators we can allocate, so operators do not
constraint II. However, in VHLS, each array is usually translated into a dual-port RAM,
5.5 Performance Analysis 165
which allows only two accesses per clock cycle [Xil12], and thus constraints MII. Following








where a ∈ A ranges over all arrays in the loop body, na is the number of accesses to the
array a, i.e. the number of shared access and update nodes accessing the array a in the
dependence graph, and ra = 2 is the maximum number of accesses allowed per cycle per
array.
The final step is to calculate recurrence-constrained minimum initiation interval (RecMII)








where c ∈ C ranges over all cycles in the graph, and lc and dc are respectively the sums
of all latencies and dependence distances of the edges in the cycle. Because a typical MIR
with array accesses could have a very large number of cycles, we efficiently search for an
MII using a modified Floyd–Warshall algorithm [Flo62], following [Rau94].
Finally, we estimate the total latency L of the loop with:
L = (N − 1)MII +D, where MII = max (RecMII,ResMII) .
Recalling from Section 2.2.3 in Chapter 2, N is the maximum trip count, i.e. the loop’s
total number of iterations, and D is the loop’s depth, i.e. the total number of cycles per
iteration.
Because we optimize MIRs in a bottom-up hierarchy, when an expression is optimized that
does not constitute an inter-iteration dependence is optimized, its latency is estimated by
scheduling its operations by using an as-late-as-possible (ALAP) [Wan+08] scheduling
algorithm, where each operation is scheduled to the latest opportunity, while respecting
the order of data dependences. This fast scheduling algorithm is also used to estimate
the depth D of a pipelined loop.
166 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
Because the expression is eventually used in a loop, and the II of the loop is critical
to how fast the loop can execute, it is necessary to start optimizing for II as soon as
possible. Therefore, in our latency analysis of a MIR or an expression that is a fragment
of a inter-iteration dependence cycle, our algorithm automatically shortens any paths
between any pairs of dependent accesses in the MIR, as we use the latency analysis
as a component to manoeuvre our optimization on the Pareto frontier. Moreover, we
place greater weight on dependent accesses with smaller dependence distances, because
these impact the resulting loop II more significantly than larger distances. For instance,
consider a loop body that has two dependent accesses across iterations, i.e. the graph
contains two cycles:
A[i] = f (A[i-1], A[i-2]);
Here as we optimize this program in a bottom-up hierarchy, f is optimized before the
loop body. For MII considerations, the subexpression tree f is thus rewritten such that










Here, l1 and l2 are respectively the lengths of the longest latency-weighted paths from
the nodes access
(
A,i− 1) and access (A,i− 2) to the root of f , and d1 and d2 are
respectively the dependence distances of the two nodes to the write of A, where d1 = 1
and d2 = 2.
5.5.2 Resource Utilization Analysis
The hardware resource usage analysis of Chapter 3 captures the sharing of common
subexpressions, but cannot analyze resource binding, which allows common operations
to be shared across clock cycles. For instance, in the floating-point expression a+ (b+c),
the two additions can be computed using one addition operator only. In this section, we
develop a new resource usage analysis that fully understands how resources are shared
temporally in an FPGA implementation of numerical programs.
5.5 Performance Analysis 167
We rely on the foundation of resource usage analysis from Chapters 3 and 4, which counts
the number n⊗ of each type of operation ⊗ ∈ Op, while maximally sharing common
subexpressions. In a pipelined loop, we compute a lower bound a⊗ on the number of







if ⊗ is shared,
n⊗ otherwise.
(5.11)
Here, integer operators are typically not shared [Li+15], so the number of operations is
the number of allocated instances.
For instance, if we know that a pipelined loop has MII = 3, and each iteration uses 6
multiplications, then we can compute that we need to synthesize at least 2 multipliers.
For straight-line code, non-pipelined loops, and different loops, we use a simple ALAP
scheduling [Wan+08] to estimate resource utilization.
Finally, we accumulate the number of LUTs and DSP elements for all allocated operators.
In addition, we estimate the number of LUTs required by multiplexers generated for
sharing operators, where RLUTsmux approximates 1/n of the number of LUTs required by














where RLUTs⊗ and RDSPs⊗ denote the number of LUTs and DSPs required by one operator
⊗ respectively.
5.5.3 Accuracy Analysis
We extend the accuracy analysis of Chapter 4 to support arrays. Because our benchmark
suite consists of programs with large arrays, we keep the analysis efficient by treating an
entire array as a pair of a floating-point interval and an interval of accumulated round-off
168 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
errors. These intervals, representing the worst case bounds of all elements in the array,
accumulate all values that are assigned to the array, and never shrink the range bounded
by these intervals when we assign new values to an array location. We therefore define
the read and write accesses to an array as follows in the accuracy analysis:
Es [access (A, ı¯)]σ] = Es [A]σ],
Es [update (A, ı¯, e)]σ] = Es [A]σ] unionsq Es [e]σ],
(5.13)
where A ∈ Var is an array variable, ı¯ is a subscript to index an element in A, e ∈
SemExpr is an expression, and σ] ∈ ΣE] is the input abstract program state (recall from
Section 4.3 of Chapter 4).
Alternatively, we can view each element A[i], where i is a tuple with N non-negative
integers, in an N -dimensional array A as a variable. Updating A thus produces an abstract
state which collects all elements in A:




Es [update (A, ı¯, e)]σ] =
[





where i ∈ Es [¯ı]σ] ranges over all possible indices bounded by Es [¯ı]σ] that are tuples with
N non-negative integers.
Additionally, because most of the loops in our benchmark programs consist of nested
loops and have large iteration counts, the fixpoint analysis routine in Section 4.3.4 is
modified to analyze only a small fraction of the innermost loop execution of a loop
nest. By ensuring the innermost loop iterator increments by the same amount, we
can guarantee the accuracy analysis to be fair for each equivalent implementation for
fixpoint expressions with different unroll factors. For the experimental outcomes in
Section 5.6, we analyze 10% of the total executions of an innermost loop for the purpose
of optimization.
Finally, we further use the dependence analysis in the MIR graph explained in Sec-
tion 5.5.1 to detect whether errors are accumulated across iterations. This process is
5.5 Performance Analysis 169
carried out by first analyzing the MIR of the loop body for intra-iteration dependences.
The absence of such dependences indicates the round-off errors do not accumulate across
iterations, and it suffices to analyze the fixpoint expression for one loop iteration.
5.6 Evaluation
We have evaluated SOAP on a suite of benchmark examples which consists of several
applications that have recurring inter-iteration dependences:
• A simple loop, sum, that sums the elements in an array.
• Two kernels from Livermore Loops [DL11]: dotprod, which computes the dot
product of two vectors, and tridiag, which solves a tridiagonal linear system of
equations.
• Nine kernels from PolyBench [Pou], which calculate matrix/vector transpositions,
additions and multiplications (2mm, 3mm, atax, gemm, gemver, mvt), the bi-
conjugate gradient stabilized method (bicg), the Seidel stencil computation
(seidel, modified to compute with 5 points), and symmetric rank-2k operations
(syr2k).
All elements of input arrays and matrices are set to be single-precision floating-point
values between 0 and 1. We optimized all of these benchmark examples using SOAP,
specifically targeting the Xilinx Virtex7 device running at 333 MHz, for the three objec-
tives of accuracy, resource utilization and latency simultaneously. We then used VHLS
2015.2 [Xil12] to synthesize the resulting optimized programs into RTL implementations
for exact latency information, and performed place-and-route using Vivado Design Suite
2015.2 [Xil15], to obtain exact resource utilization statistics. Finally, SOAP produces a
four-dimensional Pareto frontier for each program, where the dimensions are accuracy,
latency, the number of LUTs and DSP element count, for clarity we visualize the results in
three dimensions only.
170 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
Table 5.3. Comparisons of the original (non-shaded rows) and the optimized program with
lowest latency (shaded rows), for each benchmark. Values in parentheses are
obtained after slightly tweaking our experimental set-up; see Section 5.6.2. We
performed place-and-route for exact statistics.
Name DSPs LUTs Error Clock Latency
ratio ratio (ns) (cycles) (s) ratio
sum
2 303 0.257 914 µ 7.93 2.54 41.0 k 104 µ 12.84 1181 1.15 µ 2.54 3.21 k 8.17 µ
dotprod
5 411 0.231 926 µ 7.29 2.54 41.0 k 104 µ 12.410 1781 127 µ 2.62 3.23 k 8.44 µ
tridiag
5 470 0.288 63.1 µ 1.06 2.54 17.8 M 45.3 m 3.418 1631 59.4 µ 2.69 4.93 M 13.3 m
2mm
5 781 0.385 209 3.40 2.79 20.4 G 57.0 7.468 2029 61.4 2.92 2.62 G 7.64
3mm
5 760 0.207 114 6.76 2.55 32.3 G 82.3 9.1310 3677 16.9 2.82 3.19 G 9.01
atax
5 627 0.507 353 m 1.54 2.60 176 M 457 m 5.425 1237 230 m 2.61 32.4 M 84.3 m
bicg
5 427 0.304 887 µ 6.72 2.54 160 M 407 m 8.985 1406 132 µ 2.78 16.3 M 45.3 m
gemm
5 524 0.234 1.99 2.97 2.54 10.8 G 27.4 9.1310 2240 0.67 2.69 1.12 G 3.00
seidel
5 620 0.349 10.7 µ 2.46 2.60 960 M 2.50 7.168 1778 4.31 µ 2.66 131 M 0.349
gemver
5 809 0.382 7.28 M 4.46 2.87 23.1 M 66.2 m 3.155 2120 1.63 M 2.77 7.60 M 2.10 m (8.29)
mvt
5 701 0.251 91.0 µ 3.32 2.56 23.1 M 59.1 m 7.4910 2793 27.4 µ 2.80 2.82 M 7.89 m (9.30)
syr2k
5 709 0.259 250 µ 4.07 2.89 14.0 G 40.3 6.9510 2740 61.4 µ 2.71 2.14 G 5.80 (7.62)
Geomean 0.289 3.69 7.19(8.01)
5.6.1 Results
Table 5.3 compares, for each benchmark in our evaluation set, the performance metrics
of the original program against those of the program with the smallest latency discovered
by SOAP. We synthesized each program to a circuit to obtain exact statistics, which are
shown in Table 5.3.
Figure 5.11 compares our estimated LUT counts (vertical axis) against the exact LUT
counts (horizontal axis) obtained by synthesizing RTL implementations of each program
5.6 Evaluation 171


















Figure 5.11. Comparisons of our estimated LUT counts against actual LUT counts from VHLS.
in Table 5.3. Although our estimates deviate from the exact values, because we compute
lower bounds on resource utilizations, and finite state machines synthesized and address
calculation are not taken into account, our estimate can still accurately predict the general
trend—a linear regression of all scatter points finds R2 = 0.9344.
Figure 5.12 compares our latency estimates (vertical axis) against the actual latency
values (horizontal axis). The solid line represents the linear regression of data points that
we have gathered in Table 5.3. This line is a tight fit with our data, with R2 = 0.9959,
which indicates that our latency estimation can accurately predict the exact latency of
synthesized implementations.
Returning to our motivating example from Section 5.1, Figure 5.13 demonstrates the
range of optimized programs discovered by SOAP when applied to the Seidel stencil
loop kernel. In the figure, ×-points indicate the original program. By using only the
rules of real arithmetic, SOAP finds a more efficient program that can improve run time
by 2.5×, as shown by the -points. However, by enabling partial loop unrolling and
our dependence elimination rules, the performance is further improved, resulting in a
172 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
















Figure 5.12. Comparisons of our estimated latency statistics against actual latency from VHLS.
6.7× reduction of total run time. Furthermore, we have found that numerical accuracy
can often be optimized at the same time as we optimize the initiation intervals of loops.
Because by partially unrolling loops, the sizes of the expressions in loop grow, which
provides SOAP a greater freedom in terms of restructuring expressions and discovering
more accurate variants. In this example, the fastest program is also the most accurate
one: it minimizes round-off errors by approximately 2.5×. It is worth noting that SOAP
can detect that as it explores deep levels of partial loop unrolling, we start to see a
diminishing return in performance as it hits a bottleneck in memory bandwidth. This is
due to the fact that VHLS synthesizes dual port RAMs for arrays, and in one clock cycle
we can only read from the memory allocating array twice. Our optimization flow detects
this bottleneck as it prunes them from the Pareto frontier, and stops exploring further
loop unrolling.
Similar graphs for the other benchmarks can be viewed online,* each showing three
projections from different axes of the 3D Pareto frontier. Our web page can be used to in-
*https://admk.github.io/soap/plot.html
5.6 Evaluation 173
teractively explore the positions of each data point on the three projections simultaneously,
and view the corresponding generated C programs.
5.6.2 Discussion
As demonstrated by Figure 5.12, SOAP generally produces accurate latency estimates.
However, we have discovered a few notable discrepancies. For instance, gemver, mvt
and syr2k all have significant differences between our estimated latency and the actual
latency from synthesized RTL implementations. An inspection of these programs reveals
that they all share a common programming idiom:
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
x[i] += ...;
We found that VHLS occasionally fails to find the optimal schedule, predicted by SOAP,
that could pipeline this loop as tightly as possible. Rewriting the above code into the
following:
for (int i=0; i<N; i++) {
float sum = x[i];




fixes this problem, and enables VHLS to generate a hardware implementation with the
expected II. The ratios in parentheses in Table 5.3 reflect the speedup by performing this
simple fix.
After this modification, we discovered that for all of the benchmark examples, VHLS
generated circuits with the same II values predicted by SOAP, despite in practice, the
MII may be unachievable for certain applications that are resource-constrained. Because
floating-point operations are often associated with long latencies, the MII of floating-point
numerical programs could require multiple cycles, reducing the constrain imposed by
resources on the actual II.
174 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
5.7 Summary
Minimizing the latency of loops is a central task for HLS tools that obtain FPGA implemen-
tations from numerical C programs. Loop latency can often be reduced by performing
simple rewrites to minimize inter-iteration data-dependences, but HLS tools cannot en-
able such rewrites by default because they may impact the accuracy of floating-point
computations. This chapter presents the first tool that is able to automatically rewrite a
given program to optimize latency, while controlling for accuracy and resource usage. Our
experimental results suggest that, in fact, latency and accuracy are often not in conflict:
that programs aggressively optimized for latency can also have minimal round-off errors,
albeit with greater resource usage. We have demonstrated that SOAP can optimize
commonly used code fragments from PolyBench [Pou] and Livermore Loops [DL11] to
have up to a 12× increase in performance, and up to 7× reduction of round-off errors, at
the cost of up to 4× more resource utilization.
5.7 Summary 175

















































Figure 5.13. Pareto-optimal variants of the Seidel stencil program from Figure 5.2. Each graph
shows a 2D projection of the 3D Pareto frontier. In each graph, the original program
is marked ×, and the lowest-latency variant obtained by arithmetic transformations
alone is marked by the red circle.
176 Chapter 5 Accurate and Resource Efficient Pipelining of Numerical Programs
6Conclusion
HLS tools are typically designed to adhere to a rigid specification which outlines their
behaviour. It is a traditional practice to design this specification and the subsequent tool to
ensure that the synthesized circuits perform functionally identically to the original source
program written in the HLL. This is also viewed as good practice because it has predicable
outcomes. Guided by the rules of the language, programmers translate mathematical
objects such as algorithms and physical information respectively into source code and
numerical data, in a way similar to tools adhering to their specifications. This manual
process of translation is unfortunately an approximate one. Computations as simple as
√
3 must be approximated, e.g. they are carried out in floating-point arithmetic, because
of the finite nature of computing machines. Therefore, HLS tools cannot be relied upon
for an exact interpretation of the mathematical objects we wish to implement, even if
they guarantee the functional equivalence between the source code and the synthesized
result.
Despite the awareness of the approximate nature of numerical software/hardware imple-
mentations using floating-point operations, engineers often take the risk of neglecting this
fact, and anticipate their designs to behave identically to the mathematical algorithms
visioned in real arithmetic within a reasonable but not well-defined error margin. As
it was shown in Section 2.5.2 in Chapter 2, round-off errors when accumulated could
have detrimental effects on our daily life. The aforementioned functional equivalence
between source and circuit guaranteed by HLS tools is therefore unable to regain any lost
accuracies due to approximation.
Traditional IR-level HLS program optimization consists of a series of transformation passes.
Most of these passes do not predict whether they have negative impact on the resulting
circuit, and they limit their capabilities by preserving functional equivalence. Varying the
177
order of these passes could have significant impact on the quality, as these passes interact
with one another in a complicated manner, it is difficult to predict the overall impact on
performance [Hua+15]. For n passes, there are up to n ! distinct ways to order, it is thus
a considerable challenge to decide the optimal pass ordering, which is exacerbated by the
fact that it could be highly dependent on the input program [Con+13].
The above shortcomings of traditional HLS tools and optimizing compilers provide a
strong motivation for the work proposed in this thesis.
Firstly, we can apply the philosophy of relaxing the functional equivalence required by
HLS tools. In the mean time, we infer equivalences of the underlying mathematical objects
in real arithmetic which hardware designs are approximating. One can often improve the
numerical accuracy by choosing a better alternative among these equivalences.
Secondly, by the same paradigm shift, a wide range of optimization opportunities can
be explored to minimize throughput and resource utilization. These opportunities were
previously lost out to the necessity of ensuring consistent behaviour.
Finally, optimization can be carried out by applying steps of equivalence rewrites driven
by a prediction model. Traditional optimization passes can be broken up into much
smaller common parts made of equivalence rules which can easily be proved math-
ematically correct. By using models to predict run time, resources and accuracy to
guide the optimization process, it is possible to explore multiple designs that trade-off
the three performance metrics while removing concerns about the ordering problem.
Many optimization passes, such as constant propagation, dead code removal, common
subexpression elimination, etc., are naturally subsumed by the new approach. As the
computational power of machines increases exponentially, we can foresee an increase in
the scale of the vast search space to be explored in the future.
This thesis therefore broadens the horizon of HLS tools, and equips them with the new
program optimization paradigm by leveraging the above observations. Specifically, the
trade-off among numerical accuracy, resource utilization and throughput is optimized in
178 Chapter 6 Conclusion
floating-point numerical programs for HLS. Here we summarize the contributions of this
thesis.
To the best of our knowledge, this thesis is the first to introduce multiple-objective
performance optimization in a unified framework for discovering equivalence in programs.
Chapter 3 implements this framework and optimizes a suite of expressions that are difficult
to optimize by hand, and improves numerical accuracy and area automatically. In the
experimental results, it turns out that the two central goals, i.e. improving accuracy
and minimizing area, are often not in conflict, as optimized expressions can enjoy
enhancements that can be achieved in both metrics. Guided by the concept of abstract
interpretation, we further introduce the semantics-based program analyses to jointly
reason about safe ranges of round-off errors and resource utilization, and subsequently,
discovery of equivalent expressions. This technique lays the necessary foundation for
program equivalence beyond simple arithmetic expressions.
The infinite size of the equivalent program space, coupled with undecidability of program
properties, makes the program optimization an even more challenging task than the one
of arithmetic expressions. For this, Chapter 4 introduces a new graph-based intermediate
representation, MIR, for capturing the semantics of numerical programs. This approach
reduces the size of the search space, and the IR itself is derived from the formal seman-
tics of programs to ensure the correctness of equivalent MIRs and the back-and-forth
translation between C and MIR. This further eliminates the problem of optimization pass
ordering, because by using the equivalence discovery framework, the Pareto frontier can
be extended incrementally with small steps of rewrites to multiple candidates. Traditional
compiler optimizations are naturally subsumed and further enhanced by the MIRs, as
many optimization techniques such as loop splitting and loop fusion that previously must
be profiled to justify enabling them, can emerge automatically from the optimization pro-
cess. By optimizing a suite of resource-efficient benchmark examples, the tool improves
the numerical accuracy by up to 65%.
Formerly, HLS tools’ ability to pipeline loops is fundamentally constrained by intra-
iteration dependencies. Traditional optimization techniques such as partial loop unrolling
179
may have minimal effects on the initiation interval of pipelined loops, as these do
not impact the data-path structure, which ensures that the functional equivalence is
preserved. Encouraged by the promising effects of Nicolau et al.’s tree height reduction
technique [NP91] and LegUp’s recurrence minimization [Can+14], Chapter 5 further
incorporates latency analysis into the unified program optimization framework. It was
found that traditional optimization techniques when used in tandem with the arithmetic
equivalence rules and memory access reduction rules can significantly improve the latency
and accuracy of a numerical program. In Chapter 4, the experimental results identify that
the static analysis of round-off errors for each candidate explored is the key factor for the
speed of optimization. This problem is addressed in this chapter by graph partitioning
and candidate pruning algorithms. This further enables deeper partial loop unrolling
factors not explored in Chapter 4. Often as we optimize numerical programs by allowing
greater resource budgets, latency and round-off error can be simultaneously minimized,
as more resources would allow greater flexibility to discover equivalent programs that
often perform well in terms of run time and accuracy. Additionally, this process is
simultaneously driven by resource minimization, allowing the area-latency product to
decrease as we explore increasingly deeper partial unrolling. By optimizing a suite of
benchmark examples from PolyBench and Livermore loops, the tool improves the latency
and accuracy of each by up to 12× and 7× respectively, at a cost of 4× more resource
utilization.
6.1 Future Prospects
In its current form, the new approach to program optimization explained in this thesis
forms the underlying basis for a much larger set of future work. Even though it is precur-
sory on its own, the promising experimental results showcase the powerful optimization
it can bring to optimizing compilers and HLS tools. Here, a list of potential directions
of future research is discussed that could further widen the scope of our technique for a
broader range of applications.
180 Chapter 6 Conclusion
LLIR-Level Program Optimization. We could envision a back-and-forth translator from
LLIR [LA; LLIR] to MIR graphs. This could enable a much wider applicability of the
techniques presented in the thesis to both LLVM-based HLS tools and software compilers.
Additionally, it could benefit from existing LLVM optimizations passes by using the
optimized LLIR code as inputs. There are, however, obstacles in migrating to LLIR as
the source language. Firstly, LLIR is SSA-based. Since it uses temporary variables for
intermediate results in computation, a full liveness analysis [CT11; Nie+99; Boi+08]
may be necessary to eliminate temporary variables from the resulting MIR. Secondly,
control-flows in LLIR are more freely structured. Unlike C, which defines if statements
and while loops and discourages the use of goto statements, control-flow in LLIR are
composed by basic blocks and branches between pairs of them. This requires the MIR to
be further extended to cope with complex control-flow patterns. Conventionally, programs
written with branches are often analyzed using continuation style semantics [Fel+88]. It
is not evident how this semantics can be embedded within MIRs.
Tighter bounds on round-off errors. As an alternative to interval analysis, the accuracy
analysis could enjoy more sophisticated abstract domains that capture the correlations
between variables, and produce tighter bounds for results. Currently, the analysis cannot
produce meaningful, i.e. finite, bounds on the round-off errors of certain numerical
programs. If the analysis fails to bound errors, then currently the optimization cannot be
directed to a more accurate implementation. By using abstract relational domains, it is
possible to produce a much tighter bound on the values of program variables, and the
associated errors. There are a few relational domains-based static analysis techniques of
floating-point errors [Min07b; Put+04; GP11; Ast], however making use of them still
poses challenges. Each floating-point operation introduces an independent error term
as a new variable in the formulation of these relational domains, and it may be difficult
to determine how to collapse these error terms into a smaller set of variables, as the
optimization in this thesis can introduce a large number of error variables.
Special and fused operators. There could be a lot of interest in the HLS community
on how SOAP can be incorporated with existing work on fused floating-point data-path
synthesis. Langhammer et al. [LV09] propose that the normalization and denormalization
6.1 Future Prospects 181
stages could be regarded as redundant between operators in a floating-point data-path.
By removing these stages, subsets of the data-path become fixed-point data-paths, in the
meanwhile saving resources and improving throughput at a cost of accuracy. It could
be compelling to isolate the normalization/denormalization stages into operators in the
SOAP framework, so that a mixed floating-point/fixed-point program can more efficiently
trade-off resources, accuracy and latency.
Multiple word-lengths. In this thesis, experiments have been carried out on floating-
point operations with a fixed mantissa width only. It would be beneficial to further
integrate fixed-point support. Additionally, by further supporting multiple precisions
in the data-path, i.e. allowing each operator to compute with different precisions, the
trade-off relationship among our three primary performance measures can be even more
effective. Techniques, known as multiple word-length optimization [Con+11b; Lee+06;
Can+02], exist to apply a heuristic approach to perturb the precisions in a data-path, so
that a performance metric can be optimized while the round-off errors of outputs satisfy
an error budget. Instituting such techniques in the SOAP framework is rewarding as it
can further reduce the area and latency requirement of a synthesized circuit for a given
accuracy. All of these approaches optimize a fixed data-flow graph, whereas in SOAP the
structure of the data- and control-paths vary as we optimize them. Analyzing each of
the candidates for an optimal precision assignment to each operator is very inefficient
because of the number of candidates explored. Moreover, current techniques work with a
predetermined error budget, and yet in fact a Pareto frontier exists for each data-path to
trade-off accuracy, resources and latency.
Numerical analysis and linear algebra. There are two distinct approaches to the
analysis of round-off errors. One focuses on the round-off errors by statically analyzing
numerical programs, and applies this in a way which is as general as possible, similar to
the method presented in this thesis. On the other hand, there are techniques employed
by numerical analysts to evaluate and improve the numerical accuracy and stability of
particular algorithms analytically. Many creative solutions to challenges are invented
in this process. For instance, Kahan’s compensated summation algorithm is an accurate
way to compute a sum of n values,
∑n−1
i=0 xi [Kah65], which is shown in Figure 6.1. This
182 Chapter 6 Conclusion
algorithm cannot be discovered easily using the method outlined in this thesis, and a way
to extend the framework to optimize programs as creatively as humans still eludes us
at the moment. Higham et al. [Hig02] discuss in great depth many existing numerical
accuracy problems encountered in finite-precision computation of polynomials and linear
algebra subprograms and how to analyze and overcome inaccuracies, often in terms of
relative errors. Bridging the gap between computational and mathematical approaches
for numerical analysis will allow us to automate many accuracy optimizations that were
previously unexplored by the tool.
float compensated_summation(float X[N])
{
float sum = 0.0f;
float e = 0.0f;
for (i = 0; i < n; i++)
{
float tmp = sum;
float y = X[i] + e;
sum = tmp + y;








Continuity analysis and optimization. The robustness of programs is very important
to us. In many cases, we wish our algorithms to be free from discontinuity, i.e. a small
change in the initial condition would not result in an undesirably large jump in the
outputs. For this, Chaudhuri et al. [Cha+11] and Goubault et al. [GP13] respectively
propose methods to analyze the robustness of programs. The former approach formally
proves whether an algorithm is ill-conditioned in terms of the existence of discontinuity,
whereas the latter statically analyzes programs to determine whether round-off errors
introduce significant discontinuous behaviour. To illustrate, consider an if branch,
“if (e > 0) c1 else c2”, where e is a floating-point expression. If e is positive and
very close to 0 when evaluated in real arithmetic, the floating-point result of e could be
non-positive, due to the effect of the round-off errors. In these extraordinary cases, the c2
6.1 Future Prospects 183
branch may be executed instead of the intended c1. These above new techniques could
inspire us to implement the optimization of discontinuous behaviour, such as the one
shown in the example, as another objective.
Memory partitioning. The experimental results in this work see a diminishing perfor-
mance return when loops are deeply unrolled, because of a memory bottleneck. As
memory accesses saturate in loop execution, i.e. all memory ports are working in 100%
utilization, it is impossible to gain further performance improvements. Currently, the tool
stops exploring further loop unrolling when this happens. By automatically partitioning
arrays upon hitting such a memory bottleneck, further throughput improvements can be
achieved.
Integer programs. SOAP can optimize programs written with integer operations, but
with notable limitations. Firstly, accuracy analysis does not provide useful information,
because integer operations do not have round-off errors. Secondly, the scheduling analysis
in Chapter 5 can no longer accurately predict the run time performance of the resulting
circuit, because HLS tools employ operator chaining to schedule multiple fast dependent
operations within one clock cycle, whereas currently our analysis cannot estimate the
impact of this technique. Finally, resource estimation could be much less accurate, because
optimizations performed by RTL allow a single array of LUTs to be used to implement
multiple arithmetic/logic operations. It is notable that because floating-point operations
require multiple cycles to complete and their circuits are compactly designed, the latter
two limitations apply specifically to integer programs, and have negligible impact on
floating-point programs.
Other practical considerations. Finally, we may consider limiting perspectives if over-
come could make the resulting tool much more usable. Firstly, SOAP does not scale well
for programs larger than a few loops, which still requires the user to manually partition
the program into smaller snippets to be optimized individually and tune parameters
to trade quality for optimization speed. Another limitation is that programs cannot be
optimized without knowledge about the input variables. Herbie [Pan+15] makes no
assumption about the input space, and can nevertheless optimize arithmetic expressions,
184 Chapter 6 Conclusion
by splitting the input space into multiple parts to be evaluated by expressions that are
optimized for different regions.
6.2 Final Remarks
This thesis adapts existing techniques such as accuracy, latency and resource usage
analysis, and further introduces novel approaches, e.g. MIR and efficient equivalence
discovery, and delivers them in a unified framework. The functional equivalence re-
laxation paradigm is relatively under-explored, because these optimizations are often
highlighted as unsafe by the HLS tools, as they cannot analyze the numerical implica-
tions of these optimizations. HLS tools therefore have very limited optimization options
based on this particular concept. With the constructive results produced by this thesis,
optimizations based on our concept can not only raise performance measures, but also
result in even safer implementations as we improve numerical accuracies. The equiva-
lence discovery algorithm in tandem with MIRs could have great potential in compiler
optimization based on our concept. Furthermore, since machine learning algorithms are
error-resilient [Les+11; Kim+09; HB91; ZS03], the methods demonstrated in this thesis
have promising capabilities to improve their resource usage, latency and accuracy.
6.2 Final Remarks 185

Bibliography
[AJ94] Samson Abramsky and Achim Jung. Domain Theory. Ed. by S. Abramsky, Dov M.
Gabbay, and T. S. E. Maibaum. Vol. 3. Oxford: Clarendon Press, 1994 (cit. on p. 61).
[Alm+04] L. Almagor, Keith D. Cooper, Alexander Grosul, et al. “Finding Effective Compilation
Sequences”. In: Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Lan-
guages, Compilers, and Tools for Embedded Systems. LCTES ’04. Washington, DC, USA:
ACM, 2004, pp. 231–239. DOI: 10.1145/997163.997196 (cit. on p. 68).
[Alp+88] Bowen Alpern, Mark N. Wegman, and F. Kenneth Zadeck. “Detecting Equality of
Variables in Programs”. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages. POPL ’88. San Diego, California, USA: ACM,
1988, pp. 1–11. DOI: 10.1145/73560.73561 (cit. on pp. 64, 67).
[Alt10] Altera Corporation. Introduction to the Quartus II Software. 2010. URL: https:
//www.altera.com/content/dam/altera-www/global/en_US/pdfs/
literature/manual/intro_to_quartus2.pdf (cit. on pp. 25, 31, 143).
[Alt13] Altera Corporation. Floating-Point Megafunctions—User Guide. Nov. 2013. URL: http:
//www.altera.co.uk/literature/ug/ug_altfp_mfug.pdf (cit. on
p. 136).




[Alt15b] Altera Corporation. Stratix 10 Product Table. 2015. URL: https://www.altera.
com/content/dam/altera-www/global/en_US/pdfs/literature/pt/
stratix-10-product-table.pdf (cit. on p. 29).
[Alt15c] Altera Corporation. Stratix 10: The Most Powerful, Most Efficient FPGA for Signal Pro-
cessing. 2015. URL: https://www.altera.com/content/dam/altera-
www / global / en _ US / pdfs / literature / backgrounder / stratix10 -
floating-point-backgrounder.pdf (cit. on pp. 17, 30).
[Alt15d] Altera Corporation. Stratix V Device Handbook Volume 1: Device Interfaces and Integra-
tion. Dec. 2015. URL: https://www.altera.com/en_US/pdfs/literature/
hb/stratix-v/stx5_core.pdf (cit. on pp. 28–30).
[Alt16] Altera Corporation. Stratix IV Device Handbook Volume 1. Jan. 2016. URL: https:
//www.altera.com/content/dam/altera-www/global/en_US/pdfs/
literature/hb/stratix-iv/stratix4_handbook.pdf (cit. on p. 143).
187
[AM99] Micah Altman and Michael McDonald. “The Robustness of Statistical Abstractions:
A Look “Under the Hood” of Statistical Models and Software”. In: (1999) (cit. on
p. 77).
[ANS08] ANSI/IEEE. IEEE Standard for Floating-Point Arithmetic. Tech. rep. Microprocessor
Standards Committee of the IEEE Computer Society, Aug. 2008, pp. 1–58. DOI:
10.1109/ieeestd.2008.4610935 (cit. on pp. 17, 30, 61, 87).
[Apr] ACI "Sécurité & Informatique". Project APRON. URL: http://apron.cri.ensmp.
fr (cit. on p. 60).
[Ast] Patrick Cousot, Radhia Cousot, Jérôme Feret, Antoine Miné, Xavier Rival, et al. The
Astrée Static Analyzer. URL: http://www.astree.ens.fr (cit. on pp. 47, 181).
[Bac+13] David Bacon, Rodric Rabbah, and Sunil Shukla. “FPGA Programming for the Masses”.
In: Queue 11.2 (Feb. 2013), 40:40–40:52. DOI: 10.1145/2436696.2443836
(cit. on pp. 27, 28).
[Bac78] John Backus. “Can Programming Be Liberated from the Von Neumann Style? A
Functional Style and Its Algebra of Programs”. In: Commun. ACM 21.8 (Aug. 1978),
pp. 613–641. DOI: 10.1145/359576.359579 (cit. on p. 27).
[Bar13] H.P. Barendregt. The Lambda Calculus: Its Syntax and Semantics. Studies in Logic and
the Foundations of Mathematics. Elsevier Science, 2013 (cit. on p. 124).
[BDM99] H. D. Vinod B. D. McCullough. “The Numerical Reliability of Econometric Software”.
In: Journal of Economic Literature 37.2 (1999), pp. 633–665. URL: http://www.
jstor.org/stable/2565215 (cit. on p. 77).
[BDT10a] Berkeley Design Technology, Inc. An Independent Evaluation of: High-Level Synthesis
Tools for Xilinx FPGAs. 2010. URL: http://www.xilinx.com/technology/
dsp/BDTI_techpaper.pdf (cit. on p. 17).
[BDT10b] Berkeley Design Technology, Inc. High-Level Synthesis Tools for Xilinx FPGAs. Tech. rep.
2010 (cit. on pp. 32, 33).
[Bet08] Vaughn Betz. “Placement for General-Purpose FPGAS”. In: Reconfigurable Computing:
The Theory and Practice of FPGA-Based Computation . Ed. by Scott Hauck and André
DeHon. Morgan Kaufmann, 2008, pp. 299–318 (cit. on p. 31).
[Boi+08] Benoit Boissinot, Sebastian Hack, Daniel Grund, Benoît Dupont de Dine hin, and
Fabrice Rastello. “Fast Liveness Checking for Ssa-form Programs”. In: Proceedings of
the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimiza-
tion. CGO ’08. Boston, MA, USA: ACM, 2008, pp. 35–44. DOI: 10.1145/1356058.
1356064 (cit. on p. 181).
[Bou+09] Olivier Bouissou, Eric Conquet, Patrick Cousot, et al. “Space Software Validation using
Abstract Interpretation”. In: Proceedings of the International Space System Engineering
Conference, Data Systems in Aerospace (DASIA 2009). Vol. SP-669. Istambul, Turkey:
ESA, May 2009, pp. 1–7 (cit. on p. 47).
[Bri07] Robert Bridson. “Fast Poisson Disk Sampling in Arbitrary Dimensions”. In: SIGGRAPH
Sketches. 2007 (cit. on p. 160).
[Bro+10] Andre R. Brodtkorb, Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and
Olaf O. Storaasli. “State-of-the-art in heterogeneous computing”. In: Scientific Pro-
gramming 18. 2010 (cit. on p. 28).
188 Bibliography
[Can+02] M. A. Cantin, Y. Savaria, and P. Lavoie. “A comparison of automatic word length
optimization procedures”. In: IEEE International Symposium on Circuits and Sys-
tems. Vol. 2. ISCAS 2002. 2002, pp. II–612–II–615. DOI: 10.1109/ISCAS.2002.
1011427 (cit. on p. 182).
[Can+13] Andrew Canis, Jongsok Choi, Mark Aldham, et al. “LegUp: An Open-source High-level
Synthesis Tool for FPGA-based Processor/Accelerator Systems”. In: ACM Transactions
on Embedded Computing Systems 13.2 (Sept. 2013), 24:1–24:27. DOI: 10.1145/
2514740 (cit. on pp. 33–35, 38).
[Can+14] Andrew Canis, Stephen Dean Brown, and Jason Helge Anderson. “Modulo SDC
Scheduling with Recurrence Minimization in High-Level Synthesis”. In: 2014 24th
International Conference on Field Programmable Logic and Applications (FPL). Sept.
2014, pp. 1–8. DOI: 10.1109/FPL.2014.6927490 (cit. on pp. 35, 37–39, 43, 75,
149, 150, 164, 180).
[CC04] Patrick Cousot and Radhia Cousot. “Basic Concepts of Abstract Interpretation”. In:
Building the Information Society. Ed. by Renè Jacquart. Vol. 156. IFIP International
Federation for Information Processing. Springer US, 2004, pp. 359–366. DOI: 10.
1007/978-1-4020-8157-6_27 (cit. on p. 132).
[CC77] Patrick Cousot and Radhia Cousot. “Abstract interpretation: a unified lattice model
for static analysis of programs by construction or approximation of fixpoints”. In:
Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming
Languages, POPL ’77. ACM, 1977, pp. 238–252. DOI: 10.1145/512950.512973
(cit. on pp. 26, 46, 53, 56, 86).
[CF87] Ron Cytron and Jeanne Ferrante. “What’s in a name?” In: Proceedings of the 1987
International Conference on Parallel Processing. Aug. 1987, pp. 19–27 (cit. on p. 67).
[CH78] Patrick Cousot and Nicolas Halbwachs. “Automatic Discovery of Linear Restraints
Among Variables of a Program”. In: Proceedings of the 5th ACM SIGACT-SIGPLAN
Symposium on Principles of Programming Languages. POPL ’78. Tucson, Arizona: ACM,
1978, pp. 84–96. DOI: 10.1145/512760.512770 (cit. on p. 60).
[Cha+11] Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. “Prov-
ing Programs Robust”. In: Proceedings of the 19th ACM SIGSOFT Symposium and the
13th European Conference on Foundations of Software Engineering. ESEC/FSE ’11.
Szeged, Hungary: ACM, 2011, pp. 102–112. DOI: 10.1145/2025113.2025131.
URL: http://doi.acm.org/10.1145/2025113.2025131 (cit. on p. 183).
[CK04] Martine Ceberio and Vladik Kreinovich. “Greedy Algorithms for Optimizing Multivari-
ate Horner Schemes”. In: SIGSAM Bull. 38.1 (Mar. 2004), pp. 8–15. DOI: 10.1145/
980175.980179. URL: http://doi.acm.org/10.1145/980175.980179
(cit. on p. 76).
[Cla16] The Clang Team. Clang 3.9 documentation. 2016. URL: http://clang.llvm.org/
docs/UsersManual.html (cit. on p. 73).
[CM08] Philippe Coussy and Adam Morawiec. High-Level Synthesis: from Algorithm to Digital
Circuit. 1st. Springer-Verlag, 2008 (cit. on pp. 32, 33).
[Con+11a] Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. “Automatic Memory Partitioning and
Scheduling for Throughput and Power Optimization”. In: ACM Trans. Des. Autom.
Electron. Syst. 16.2 (Apr. 2011), 15:1–15:25. DOI: 10.1145/1929943.1929947
(cit. on p. 45).
Bibliography 189
[Con+11b] G.A. Constantinides, A.B. Kinsman, and N. Nicolici. “Numerical Data Representations
for FPGA-Based Scientific Computing”. In: IEEE Design and Test of Computers 28.4
(2011), pp. 8–17. DOI: 10.1109/MDT.2011.48 (cit. on p. 182).
[Con+12] Jason Cong, Peng Zhang, and Yi Zou. “Optimizing Memory Hierarchy Allocation
with Loop Transformations for High-level Synthesis”. In: Proceedings of the 49th
Annual Design Automation Conference. DAC ’12. San Francisco, California: ACM,
2012, pp. 1233–1238. DOI: 10.1145/2228360.2228586 (cit. on p. 45).
[Con+13] Jason Cong, Bin Liu, Raghu Prabhakar, and Peng Zhang. “A Study on the Impact of
Compiler Optimizations on High-Level Synthesis”. In: 25th International Workshop
on Languages and Compilers for Parallel Computing. Ed. by Hironori Kasahara and
Keiji Kimura. Springer Berlin Heidelberg, 2013, pp. 143–157. DOI: 10.1007/978-
3-642-37658-0_10 (cit. on p. 178).
[CP08] Jason Cong and Peichen Pan. “Technology Mapping”. In: Reconfigurable Computing:
The Theory and Practice of FPGA-Based Computation . Ed. by Scott Hauck and André
DeHon. Morgan Kaufmann, 2008, pp. 277–296 (cit. on p. 31).
[CT11] Keith D. Cooper and Linda Torczon. Engineering a Compiler. 2nd. Morgan Kaufmann,
Feb. 2011 (cit. on p. 181).
[Cyt+86] Ron Cytron, Andy Lowry, and F. Kenneth Zadeck. “Code Motion of Control Struc-
tures in High-level Languages”. In: Proceedings of the 13th ACM SIGACT-SIGPLAN
Symposium on Principles of Programming Languages. POPL ’86. St. Petersburg Beach,
Florida: ACM, 1986, pp. 70–85. DOI: 10.1145/512644.512651 (cit. on p. 64).
[Cyt+91] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth
Zadeck. “Efficiently Computing Static Single Assignment Form and the Control
Dependence Graph”. In: ACM TOPLAS 13.4 (Oct. 1991), pp. 451–490 (cit. on pp. 64,
67).
[CZ06] Jason Cong and Zhiru Zhang. “An efficient and versatile scheduling algorithm based
on SDC formulation”. In: 2006 43rd ACM/IEEE Design Automation Conference. June
2006, pp. 433–438. DOI: 10.1109/DAC.2006.229228 (cit. on pp. 35, 42).
[Dam+14] Nasrine Damouche, Matthieu Martel, and Alexandre Chapoutot. “Transformation of a
PID Controller for Numerical Accuracy”. In: 7th International Workshop on Numerical
Software Verification. NSV ’14. July 2014 (cit. on p. 143).
[Dam+15] Nasrine Damouche, Matthieu Martel, and Alexandre Chapoutot. “Intra-procedural
Optimization of the Numerical Accuracy of Programs”. In: Formal Methods for In-
dustrial Critical Systems. Ed. by Manuel Núñez and Matthias Güdemann. Vol. 9128.
LNCS. Springer, 2015, pp. 31–46 (cit. on pp. 84, 115).
[Dar+13] Eva Darulova, Viktor Kuncak, Rupak Majumdar, and Indranil Saha. On the Genera-
tion of Precise Fixed-Point Expressions. Tech. rep. École Polytechnique Fédérale De
Lausanne, 2013 (cit. on p. 77).
[Dem11] James Demmel. “Accurate and efficient expression evaluation and linear algebra”. In:
Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation,
SNC ’11. ACM, 2011, p. 2. DOI: 10.1145/2331684.2331686 (cit. on p. 107).
[DL11] Jack Dongarra and Piotr Luszczek. “Livermore Loops”. In: Encyclopedia of Parallel
Computing. Springer US, 2011, pp. 1041–1043. DOI: 10.1007/978- 0- 387-
09766-4_161 (cit. on pp. 150, 152, 170, 175).
190 Bibliography
[DMB08] Leonardo De Moura and Nikolaj Bjørner. “Z3: An Efficient SMT Solver”. In:Proceed-
ings of the Theory and Practice of Software, 14th International Conference on Tools
and Algorithms for the Construction and Analysis of Systems. TACAS’08/ETAPS’08.
Budapest, Hungary: Springer-Verlag, 2008, pp. 337–340. DOI: 10.1007/978-3-
540-78800-3_24 (cit. on p. 46).
[Dow97] Mark Dowson. “The Ariane 5 Software Failure”. In: ACM SIGSOFT Software En-
gineering Notes 22.2 (Mar. 1997), p. 84. DOI: 10.1145/251880.251992. URL:
http://doi.acm.org/10.1145/251880.251992 (cit. on p. 46).
[DP11] F. de Dinechin and B. Pasca. “Designing Custom Arithmetic Data Paths with FloPoCo”.
In: IEEE Design and Test of Computers 28.4 (2011), pp. 18–27. DOI: 10.1109/MDT.
2011.44 (cit. on p. 105).
[ECL] BUGSENG srl. ECLAIR: a general platform for software verification. URL: http://
bugseng.com/products/eclair (cit. on p. 47).
[Fan+08] Kevin Fan, Hyun hul Park, Manjunath Kudlur, and S ott Mahlke. “Modulo Scheduling
for Highly Customized Datapaths to Increase Hardware Reusability”. In: Proceedings of
the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimiza-
tion. CGO ’08. Boston, MA, USA: ACM, 2008, pp. 124–133. DOI: 10.1145/1356058.
1356075. URL: http://doi.acm.org/10.1145/1356058.1356075 (cit. on
p. 38).
[Fel+88] Matthias Felleisen, Mitch Wand, Daniel Friedman, and Bruce Duba. “Abstract Contin-
uations: A Mathematical Semantics for Handling Full Jumps”. In: Proceedings of the
1988 ACM Conference on LISP and Functional Programming. LFP ’88. Snowbird, Utah,
USA: ACM, 1988, pp. 52–62. DOI: 10.1145/62678.62684 (cit. on p. 181).
[Flo62] Robert W. Floyd. “Algorithm 97: Shortest Path”. In: Communications of the ACM 5.6
(June 1962), pp. 345–. DOI: 10.1145/367766.368168. URL: http://doi.acm.
org/10.1145/367766.368168 (cit. on pp. 41, 59, 166).
[Flu] Fluctuat: an abstract-interpretation based static analyzer of numerical programs. URL:
http://www.lix.polytechnique.fr/Labo/Sylvie.Putot/fluctuat.
html (cit. on p. 47).
[Fou+07] Laurent Fousse, Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, and Paul Zim-
mermann. “MPFR: A multiple-precision binary floating-point library with correct
rounding”. In: ACM Transactions on Mathematical Software (TOMS) 33.2 (2007),
p. 13 (cit. on p. 104).
[Gaj+92] Daniel Gajski, Nikil Dutt, Allen Wu, and Steve Lin. High-Level Synthesis—Introduction
to Chip and System Design. Kluwer Boston, 1992 (cit. on pp. 17, 25, 32).
[Gao+13] Xitong Gao, Samuel Bayliss, and George A. Constantinides. “SOAP: Structural Opti-
mization of Arithmetic Expressions for High-Level Synthesis”. In: Processings of the
2013 International Conference on Field-Programmable Technology. FPT ’13. Dec. 2013,
pp. 112–119. DOI: 10.1109/FPT.2013.6718340 (cit. on pp. 23, 24).
[Gao+16] Xitong Gao, John Wickerson, and George A. Constantinides. “Automatically Opti-
mizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis”. In:
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. FPGA ’16. Monterey, California, USA: ACM, 2016, pp. 234–243. DOI:
10.1145/2847263.2847282 (cit. on pp. 23, 24).
Bibliography 191
[GC15] Xitong Gao and George A. Constantinides. “Numerical Program Optimization for
High-Level Synthesis”. In: Proceedings of the 2015 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays. FPGA ’15. Monterey, California, USA: ACM,
2015, pp. 210–213. DOI: 10.1145/2684746.2689090 (cit. on pp. 23, 24).
[Gho+09] Khalil Ghorbal, Eric Goubault, and Sylvie Putot. “Proceedings of the 21st International
Conference on Computer Aided Verification”. In: ed. by Ahmed Bouajjani and Oded
Maler. Springer Berlin Heidelberg, 2009. Chap. The Zonotope Abstract Domain
Taylor1+, pp. 627–633. DOI: 10.1007/978-3-642-02658-4_47 (cit. on p. 60).
[Gol91] David Goldberg. “What every computer scientist should know about floating-point
arithmetic”. In: ACM Computing Surveys (CSUR) 23.1 (1991), pp. 5–48 (cit. on pp. 18,
88).
[GP11] Eric Goubault and Sylvie Putot. “Static Analysis of Finite Precision Computations”. In:
Proceedings of the 12th International Conference on Verification, Model Checking, and
Abstract Interpretation. VMCAI’11. Austin, TX, USA: Springer-Verlag, 2011, pp. 232–
247. URL: http://dl.acm.org/citation.cfm?id=1946284.1946301
(cit. on p. 181).
[GP13] Eric Goubault and Sylvie Putot. “Robustness Analysis of Finite Precision Implemen-
tations”. In: Proceedings of the 11th Asian Symposium on Programming Languages
and Systems - Volume 8301. Springer-Verlag New York, Inc., 2013, pp. 50–57. DOI:
10.1007/978-3-319-03542-0_4 (cit. on p. 183).
[GR94] Daniel D. Gajski and Loganath Ramachandran. “Introduction to High-Level Synthesis”.
In: IEEE Design and Test of Computers 11.4 (Oct. 1994), pp. 44–54 (cit. on p. 17).
[Gra+91] Torbjörn Granlund et al. GMP, the GNU multiple precision arithmetic library. 1991.
URL: http://gmplib.org/ (cit. on p. 104).
[Gra89] Philippe Granger. “Static analysis of arithmetical congruences”. In: International
Journal of Computer Mathematics 30.3-4 (1989), pp. 165–190 (cit. on p. 60).
[Guc08] Steven A. Guccione. “Configuration Bitstream Generation”. In: Reconfigurable Com-
puting: The Theory and Practice of FPGA-Based Computation. Ed. by Scott Hauck and
André DeHon. Morgan Kaufmann, 2008, pp. 401–410 (cit. on p. 32).
[Gup+04] Sumit Gupta, Rajesh Gupta, Nikil D. Dutt, and Alexandru Nicolau. SPARK: A Paral-
lelizing Approach to the High-Level Synthesis of Digital Circuits. 1st. Springer US, 2004,
pp. 10–1–10–20. DOI: 10.1007/b117058 (cit. on pp. 44, 68).
[Ham86] Richard W Hamming. Numerical Methods for Scientists and Engineers. 2nd. New York,
NY, USA: Dover Publications, Inc., 1986 (cit. on p. 83).
[HB91] J. L. Holt and T. E. Baker. “Back propagation simulations using limited precision
calculations”. In: International Joint Conference on Neural Networks. Vol. ii. July 1991,
121–126 vol.2. DOI: 10.1109/IJCNN.1991.155324 (cit. on p. 185).
[Hig02] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. 2nd. Society for
Industrial and Applied Mathematics, 2002 (cit. on pp. 18, 183).
[Hos+04] A. Hosangadi, F. Fallah, and R. Kastner. “Factoring and eliminating common subex-
pressions in polynomial expressions”. In: Proceedings of the 2004 IEEE/ACM Interna-
tional Conference on Computer-aided Design, ICCAD ’04. IEEE Computer Society, 2004,
pp. 169–174. DOI: 10.1109/ICCAD.2004.1382566 (cit. on p. 76).
192 Bibliography
[HP11] John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition: A
Quantitative Approach. 5th. Morgan Kaufmann Publishers Inc., 2011 (cit. on p. 27).
[Hua+15] Qijing Huang, Ruolong Lian, Andrew Canis, et al. “The Effect of Compiler Opti-
mizations on High-Level Synthesis-Generated Hardware”. In: ACM Transactions on
Reconfigurable Technology and Systems 8.3 (May 2015), 14:1–14:26. DOI: 10.1145/
2629547 (cit. on p. 178).
[Hwa+91] C. T. Hwang, J. H. Lee, and Y. C. Hsu. “A formal approach to the scheduling problem
in high level synthesis”. In: IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 10.4 (Apr. 1991), pp. 464–475. DOI: 10.1109/43.75629
(cit. on pp. 37, 43).
[IEE06] IEEE. “IEEE Standard for Verilog Hardware Description Language”. In: IEEE Std
1364-2005 (Revision of IEEE Std 1364-2001) (Apr. 2006), pp. 1–560. DOI: 10.1109/
IEEESTD.2006.99495. URL: http://ieeexplore.ieee.org/servlet/
opac?punumber=10779 (cit. on p. 30).
[IEE09] IEEE. “IEEE Standard VHDL Language Reference Manual”. In: IEEE Std 1076-2008
(Revision of IEEE Std 1076-2002) (Jan. 2009), pp. 1–626. DOI: 10.1109/IEEESTD.
2009.4772740. URL: http://ieeexplore.ieee.org/servlet/opac?
punumber=4772738 (cit. on p. 30).
[IM12] Arnault Ioualalen and Matthieu Martel. “A new abstract domain for the representation
of mathematically equivalent expressions”. In: Proceedings of the 19th International
Conference on Static Analysis, SAS ’12. Springer-Verlag, 2012, pp. 75–93 (cit. on
pp. 20, 77–80, 84–86).
[Imp] Impulse Accelerated Technologies, Inc. URL: http://www.impulsec.com (cit. on
p. 33).
[Kah65] William Kahan. “Pracniques: Further Remarks on Reducing Truncation Errors”. In:
Communications of the ACM 8.1 (Jan. 1965), pp. 40–. DOI: 10.1145/363707.
363723 (cit. on p. 182).
[Kar10] Richard M. Karp. “Reducibility Among Combinatorial Problems”. In: 50 Years of
Integer Programming 1958-2008: From the Early Years to the State-of-the-Art . Ed. by
Michael Jünger, M. Thomas Liebling, Denis Naddef, et al. Springer Berlin Heidelberg,
2010, pp. 219–241. DOI: 10.1007/978-3-540-68279-0_8 (cit. on p. 43).
[KD08] Nachiket Kapre and André DeHon. “Programming FPGA applications in VHDL”. In:
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Ed.
by Scott Hauck and André DeHon. Morgan Kaufmann, 2008, pp. 129–154 (cit. on
p. 30).
[Kim+09] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun. “A highly scalable Restricted
Boltzmann Machine FPGA implementation”. In: 2009 International Conference on
Field Programmable Logic and Applications. Aug. 2009, pp. 367–372. DOI: 10.1109/
FPL.2009.5272262 (cit. on p. 185).
[Knu64] Donald E. Knuth. “Backus Normal Form vs. Backus Naur Form”. In: Commun. ACM
7.12 (Dec. 1964), pp. 735–736. DOI: 10.1145/355588.365140 (cit. on p. 117).
[Kro+03] Daniel Kroening, Edmund Clarke, and Karen Yorav. “Behavioral Consistency of C
and Verilog Programs Using Bounded Model Checking”. In: Proceedings of DAC 2003.
ACM Press, 2003, pp. 368–371 (cit. on p. 46).
Bibliography 193
[Kuc+] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. “Dependence Graphs
and Compiler Optimizations”. In: Proceedings of the 8th ACM SIGPLAN-SIGACT Sym-
posium on Principles of Programming Languages. POPL ’81. Williamsburg, Virginia,
pp. 207–218. DOI: 10.1145/567532.567555 (cit. on p. 132).
[Kuh10] Harold W. Kuhn. “The Hungarian Method for the Assignment Problem”. In: 50 Years
of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art. Ed. by
Michael Jünger, M. Thomas Liebling, Denis Naddef, et al. Springer Berlin Heidelberg,
2010, pp. 29–47. DOI: 10.1007/978-3-540-68279-0_2 (cit. on p. 35).
[KW03] Markus Kowarschik and Christian Weiß. “An Overview of Cache Optimization Tech-
niques and Cache-Aware Numerical Algorithms”. In: Algorithms for Memory Hierar-
chies — Advanced Lectures, volume 2625 of Lecture Notes in Computer Science. Springer,
2003, pp. 213–232 (cit. on p. 27).
[LA] Chris Lattner and Vikram Adve. “ LLVM: A Compilation Framework for Lifelong
Program Analysis & Transformation”. In:Proceedings of the International Symposium
on Code Generation and Optimization. CGO ’04. Palo Alto, California, p. 75. URL:
http://dl.acm.org/citation.cfm?id=977395.977673 (cit. on pp. 34,
142, 181).
[Lee+06] D. U. Lee, A. A. Gaffar, R. C. C. Cheung, et al. “Accuracy-Guaranteed Bit-Width
Optimization”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 25.10 (Oct. 2006), pp. 1990–2000. DOI: 10.1109/TCAD.2006.
873887 (cit. on p. 182).
[Leg+10] Julien Legriel, Colas Le Guernic, Scott Cotton, and Oded Maler. “Approximating the
Pareto Front of Multi-criteria Optimization Problems”. In: Proceedings of the 16th
International Conference on Tools and Algorithms for the Construction and Analysis
of Systems. TACAS’10. Paphos, Cyprus: Springer-Verlag, 2010, pp. 69–83. DOI: 10.
1007/978-3-642-12002-2_6 (cit. on p. 98).
[Les+11] Bernd Lesser, Manfred Mücke, and Wilfried N. Gansterer. “Effects of Reduced Preci-
sion on Floating-Point SVM Classification Accuracy”. In: Proceedings of the Interna-
tional Conference on Computational Science, ICCS 2011. Vol. 4. 2011, pp. 508–517.
URL: http://dx.doi.org/10.1016/j.procs.2011.04.053 (cit. on p. 185).
[Li+15] Peng Li, Peng Zhang, Louis-Noel Pouchet, and Jason Cong. “Resource-Aware Through-
put Optimization for High-Level Synthesis”. In: FPGA. 2015 (cit. on p. 168).
[LLIR] The LLVM Development Team. LLVM Language Reference Manual. URL: http://
llvm.org/docs/LangRef.html (cit. on pp. 26, 35, 64, 181).
[LU] University of Toronto. LegUp Documentation—Release 4.0. Oct. 2015. URL: http:
//legup.eecg.utoronto.ca/docs/4.0/legup-4.0-doc.pdf (cit. on
pp. 25, 33–35, 64, 74, 75, 143, 164).
[LV09] Martin Langhammer and Tom VanCourt. “FPGA Floating Point Datapath Compiler”.
In: 17th IEEE Symposium on Field Programmable Custom Computing Machines, FCCM
’09. IEEE. 2009, pp. 259–262 (cit. on p. 181).
[Mar07] Matthieu Martel. “Semantics-Based Transformation of Arithmetic Expressions”. In:
Static Analysis, Lecture Notes in Computer Science . Vol. 4634. Springer-Verlag, 2007,
pp. 298–314. DOI: 10.1007/978-3-540-74061-2_19 (cit. on pp. 61, 77, 85–88,
97, 105).
194 Bibliography
[Mar09] Matthieu Martel. “Program transformation for numerical precision”. In: Proceedings
of the 2009 ACM SIGPLAN workshop on Partial evaluation and program manipulation,
PEPM ’09. ACM, 2009, pp. 101–110. DOI: 10.1145/1480945.1480960 (cit. on
pp. 103, 115).
[Mar12] Matthieu Martel. “Accurate Evaluation of Arithmetic Expressions (Invited Talk)”. In:
Electron. Notes Theor. Comput. Sci. 287 (Nov. 2012), pp. 3–16. DOI: 10.1016/j.
entcs.2012.09.002 (cit. on p. 79).
[McF+90] Michael C McFarland, Alice C Parker, and Raul Camposano. “The high-level synthesis
of digital systems”. In: Proceedings of the IEEE 78.2 (1990), pp. 301–318 (cit. on
p. 33).
[Mee+12] Wim Meeus, Kristof Van Beeck, Toon Goedemé, Jan Meel, and Dirk Stroobandt. “An
Overview of Today’s High-Level Synthesis Tools”. In: Design Automation for Embedded
Systems 16.3 (2012), pp. 31–51 (cit. on p. 17).
[MG] Inc Mentor Graphics. Catapult High-Level Synthesis Platform. URL: https://www.
mentor.com/hls-lp/ (cit. on p. 33).
[Min04] Antoine Miné. “Weakly relational numerical abstract domains”. PhD thesis. Ecole
Polytechnique, 2004. URL: http://www.di.ens.fr/~mine/these/these-
color.pdf (cit. on pp. 46, 59).
[Min07a] Antoine Miné. “A New Numerical Abstract Domain Based on Difference-Bound
Matrices”. In: CoRR abs/cs/0703073 (2007). URL: http://arxiv.org/abs/cs/
0703073 (cit. on p. 59).
[Min07b] Antoine Miné. “Relational Abstract Domains for the Detection of Floating-Point Run-
Time Errors”. In: CoRR abs/cs/0703077 (2007). URL: http://arxiv.org/abs/
cs/0703077 (cit. on p. 181).
[Moo+09] Ramon E. Moore, R. Baker Kearfott, and Michael J. Cloud. Introduction to Interval
Analysis. Society for Industrial and Applied Mathematics, 2009 (cit. on p. 53).
[Mou11] Christophe Mouilleron. “Efficient Computation with Structured Matrices and Arith-
metic Expressions”. PhD thesis. Ecole Normale Supérieure de Lyon-ENS LYON, 2011
(cit. on pp. 20, 78).
[MS03] Cameron McNairy and Don Soltis. “Itanium 2 Processor Microarchitecture”. In: IEEE
Micro 23.2 (2003), pp. 44–55. DOI: http://doi.ieeecomputersociety.org/
10.1109/MM.2003.1196114 (cit. on p. 38).
[Mul05] Jean-Michel Muller. “On the definition of ulp(x)”. In: Research report, Laboratoire de
l’Informatique du Parallélisme, RR2005-09. Feb. 2005 (cit. on p. 62).
[Naj07] Walid A. Najjar. “Compiling code accelerators for FPGAs”. In:2007 5th IEEE/ACM/I-
FIP International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS). Sept. 2007, pp. 2–2 (cit. on p. 33).
[Nam+04] R. Namballa, N. Ranganathan, and A Ejnioui. “Control and data flow graph extrac-
tion for high-level synthesis”. In: Proceedings of the IEEE Computer Society Annual
Symposium on VLSI, 2004. Feb. 2004, pp. 187–192. DOI: 10.1109/ISVLSI.2004.
1339528 (cit. on p. 123).
Bibliography 195
[Nan+16] Razvan Nane, Vlad-Mihai Sima, Christian Pilato, et al. “A Survey and Evaluation of
FPGA High-Level Synthesis Tools”. In: IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems PP.99 (2016), pp. 1–1. DOI: 10.1109/TCAD.2015.
2513673 (cit. on pp. 17, 45).
[Neu01] Arnold Neumaier. Introduction to numerical analysis. Cambridge University Press,
2001 (cit. on p. 76).
[Nie+99] Flemming Nielson, Hanne R. Nielson, and Chris Hankin. Principles of Program Analy-
sis. Springer-Verlag Berlin Heidelberg, 1999. DOI: 10.1007/978-3-662-03811-6
(cit. on pp. 46, 47, 51, 56–58, 181).
[NP91] Alexandru Nicolau and Roni Potasmann. “Incremental Tree Height Reduction for
High Level Synthesis”. In: DAC. 1991 (cit. on pp. 74, 180).
[Off92] General Accounting Office. Patriot Missile Software Problem. Tech. rep. GAO/IMTEC-
92-26. Washington, D.C., USA, Feb. 1992. URL: http://fas.org/spp/starwars/
gao/im92026.htm (cit. on p. 46).
[OG86] Alex Orailoglu and Daniel D. Gajski. “Flow Graph Representation”. In: Proceedings of
the 23rd ACM/IEEE Design Automation Conference. DAC ’86. Las Vegas, Nevada, USA:
IEEE Press, 1986, pp. 503–509. URL: http://dl.acm.org/citation.cfm?id=
318013.318093 (cit. on p. 68).
[Pan+15] Pavel Panchekha, Alex Sanchez-Stern, James R. Wilcox, and Zachary Tatlock. “Au-
tomatically Improving Accuracy for Floating Point Expressions”. In: SIGPLAN Not.
50.6 (June 2015), pp. 1–11. DOI: 10.1145/2813885.2737959. URL: http:
//doi.acm.org/10.1145/2813885.2737959 (cit. on pp. 76, 77, 80, 81, 83,
184).
[PB04] Karen Parnell and Roger Bryner. Comparing and Contrasting FPGA and Microprocessor
System Design and Development . Tech. rep. July 2004. URL: http://www.xilinx.
com/support/documentation/white_papers/wp213.pdf (cit. on p. 28).
[PBF] MathWorks. Polyspace Bug Finder. URL: http://mathworks.com/products/
polyspace-bug-finder/ (cit. on p. 47).
[PDM01] A. Peymandoust and G. De Micheli. “Using symbolic algebra in algorithmic level DSP
synthesis”. In: Design Automation Conference, 2001. Proceedings. 2001, pp. 277–282.
DOI: 10.1109/DAC.2001.156151 (cit. on p. 76).
[Pou] Louis-Noel Pouchet. PolyBench/C—the Polyhedral Benchmark suite. http://web.
cse.ohio-state.edu/~pouchet/software/polybench/ (cit. on pp. 150,
152, 153, 170, 175).
[PPL] BUGSENG srl. The Parma Polyhedra Library (PPL). URL: http://bugseng.com/
products/ppl/ (cit. on p. 60).
[Put+04] Sylvie Putot, Eric Goubault, and Matthieu Martel. “Static Analysis-Based Validation
of Floating-Point Computations”. In: Numerical Software with Result Verification:
International Dagstuhl Seminar (2004). Ed. by René Alt, Andreas Frommer, R. Baker
Kearfott, and Wolfram Luther, pp. 306–313. DOI: 10.1007/978-3-540-24738-
8_18 (cit. on p. 181).
196 Bibliography
[Put+14] Andrew Putnam, Adrian Caulfield, Eric Chung, et al. “A Reconfigurable Fabric for
Accelerating Large-Scale Datacenter Services”. In: 41st Annual International Sym-
posium on Computer Architecture (ISCA). June 2014. URL: http://research.
microsoft.com/apps/pubs/default.aspx?id=212001 (cit. on p. 28).
[Ram+99] Ganesan Ramalingam, Junehwa Song, Leo Joskowicz, and Raymond E. Miller. “Solv-
ing Systems of Difference Constraints Incrementally”. In: Algorithmica 23 (3 1999),
pp. 261–275. DOI: 10.1007/PL00009261 (cit. on p. 43).
[Rau92] B. Ramakrishna Rau. “Data Flow and Dependence Analysis for Instruction Level
Parallelism”. In: Languages and Compilers for Parallel Computing. Vol. 589. LNCS.
Springer, 1992, pp. 236–250 (cit. on pp. 64, 67).
[Rau94] B. Ramakrishna Rau. “Iterative Modulo Scheduling: An Algorithm for Software
Pipelining Loops”. In: MICRO. 1994 (cit. on pp. 38, 39, 41, 67, 150, 166).
[Ric53] Henry G. Rice. “Classes of Recursively Enumerable Sets and Their Decision Problems”.
In: Transactions of the American Mathematical Society 74.2 (1953), pp. 358–366. URL:
http://www.jstor.org/stable/1990888 (cit. on p. 46).
[Ros+88] B. K. Rosen, M. N. Wegman, and F. K. Zadeck. “Global Value Numbers and Redundant
Computations”. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages. POPL ’88. San Diego, California, USA: ACM,
1988, pp. 12–27. DOI: 10.1145/73560.73562 (cit. on p. 64).
[Sch+02] Robert Schreiber, Shail Aditya, Scott Mahlke, et al. “PICO-NPA: High-Level Synthesis
of Nonprogrammable Hardware Accelerators”. In: Journal of VLSI signal process-
ing systems for signal, image and video technology 31.2 (2002), pp. 127–142. DOI:
10.1023/A:1015341305426. URL: http://dx.doi.org/10.1023/A:
1015341305426 (cit. on pp. 33, 38).
[Sch+14] Eric Schkufza, Rahul Sharma, and Alex Aiken. “Stochastic Optimization of Floating-
point Programs with Tunable Precision”. In: Proceedings of the 35th ACM SIGPLAN
Conference on Programming Language Design and Implementation. PLDI ’14. Edin-
burgh, United Kingdom: ACM, 2014, pp. 53–64. DOI: 10.1145/2594291.2594302
(cit. on p. 81).
[Sch05] Alexander Schrijver. “On the History of Combinatorial Optimization (Till 1960)”. In:
Discrete Optimization. Ed. by G.L. Nemhauser K. Aardal and R. Weismantel. Vol. 12.
Handbooks in Operations Research and Management Science. Elsevier, 2005, pp. 1–
68. DOI: 10.1016/S0927-0507(05)12001-5 (cit. on p. 42).
[Sem] How an FPGA Approach to Complex System Design Can Improve Profitability: Real
Case Studies. Tech. rep. Semico Research Corporation. URL: http://www.xilinx.
com/publications/prod_mktg/easypath-7-fpga-asic-approach.pdf
(cit. on p. 28).
[SF08] Scott Sirowy and Alessandro Forin. Where’s the Beef? Why FPGAs Are So Fast. Tech.
rep. MSR-TR-2008-130. Microsoft, Aug. 2008 (cit. on p. 28).
[SS90] Harald Sondergaard and Peter Sestoft. “Referential Transparency, Definiteness and
Unfoldability”. In: Acta Informatica 27.6 (Jan. 1990), pp. 505–517. URL: http:
//dl.acm.org/citation.cfm?id=79245.79247 (cit. on p. 157).
Bibliography 197
[St16] Richard M Stallman and the GCC Developer Community. Using and porting the GNU
compiler collection—For GCC Version 6.1.0. Free Software Foundation, Inc, 2016. URL:
https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc.pdf (cit. on p. 73).
[Sud+16] Naveen Suda, Vikas Chandra, Ganesh Dasika, et al. “Throughput-Optimized OpenCL-
based FPGA Accelerator for Large-Scale Convolutional Neural Networks”. In: Pro-
ceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. FPGA ’16. Monterey, California, USA: ACM, 2016, pp. 16–25. DOI:
10.1145/2847263.2847276 (cit. on p. 45).
[Tar55] Alfred Tarski. “A Lattice-Theoretical Fixpoint Theorem and Its Applications”. In: Pacific
Journal of Mathematics 5.2 (1955), pp. 285–309. URL: http://projecteuclid.
org/euclid.pjm/1103044538 (cit. on p. 51).
[Tat+] Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. “Equality Saturation: A
New Approach to Optimization”. In: Proceedings of the 36th Annual ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages. POPL ’09. Savannah,
GA, USA, pp. 264–276. DOI: 10.1145/1480881.1480915 (cit. on pp. 26, 68, 69,
71, 72, 78, 83).
[TB06] Sid-Ahmed-Ali Touati and Denis Barthou. “On the Decidability of Phase Order-
ing Problem in Optimizing Compilation”. In: Proceedings of the 3rd Conference
on Computing Frontiers. CF ’06. Ischia, Italy: ACM, 2006, pp. 147–156. DOI: 10.
1145/1128022.1128042. URL: http://doi.acm.org/10.1145/1128022.
1128042 (cit. on p. 68).
[Tho+09] David Barrie Thomas, Lee Howes, and Wayne Luk. “A Comparison of CPUs, GPUs,
FPGAs, and Massively Parallel Processor Arrays for Random Number Generation”.
In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable
Gate Arrays. FPGA ’09. Monterey, California, USA: ACM, 2009, pp. 63–72. DOI:
10.1145/1508128.1508139 (cit. on p. 28).
[TM14] Neil Toronto and Jay McCarthy. “Practically Accurate Floating-Point Math”. In:
Computing in Science Engineering 16.4 (July 2014), pp. 80–95. DOI: 10.1109/MCSE.
2014.90 (cit. on p. 77).
[Tri+05] J. L. Tripp, K. D. Peterson, C. Ahrens, J. D. Poznanovic, and M. B. Gokhale. “Trident: an
FPGA compiler framework for floating-point algorithms”. In: International Conference
on Field Programmable Logic and Applications, 2005. Aug. 2005, pp. 317–322. DOI:
10.1109/FPL.2005.1515741 (cit. on pp. 33, 38).
[Tur37] Alan M. Turing. “On computable numbers, with an application to the Entschei-
dungsproblem”. In: Proceedings of the London Mathematical Society . 1937, pp. 230–
265. DOI: 10.1112/plms/s2-42.1.230 (cit. on p. 46).
[Ver10] Sven Verdoolaege. “isl: An Integer Set Library for the Polyhedral Model”. In: ICMS.
2010 (cit. on p. 165).
[Wan+08] Gang Wang, Wenrui Gong, and Ryan Kastner. “Operation Scheduling: Algorithms
and Applications”. In: High-Level Synthesis. Springer, 2008 (cit. on pp. 166, 168).
[Wan+13] Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. “Memory Parti-
tioning for Multidimensional Arrays in High-Level Synthesis”. In: Design Automation
Conference (DAC), 2013 50th ACM/EDAC/IEEE. May 2013, pp. 1–8 (cit. on p. 45).
198 Bibliography
[Win+13] Felix Winterstein, Samuel Bayliss, and George A. Constantinides. “High-Level Synthe-
sis of Dynamic Data Structures: A Case Study Using Vivado HLS”. In: 2013 Interna-
tional Conference on Field-Programmable Technology (FPT). Dec. 2013, pp. 362–365.
DOI: 10.1109/FPT.2013.6718388 (cit. on p. 44).
[Win+15] Felix J. Winterstein, Samuel R. Bayliss, and George A. Constantinides. “Separation
Logic for High-Level Synthesis”. In: ACM Trans. Reconfigurable Technology and Systems
9.2 (Dec. 2015), 10:1–10:23. DOI: 10.1145/2836169 (cit. on p. 45).
[Wir14] Loring Wirbel. Xilinx SDAccel—A Unified Development Environment for Tomorrow’s
Data Center. Tech. rep. Nov. 2014. URL: http://www.xilinx.com/publications/
prod_mktg/sdx/sdaccel-wp.pdf (cit. on p. 33).
[WM94] Wm. A. Wulf and Sally A. McKee. Hitting the Memory Wall: Implications of the Obvious.
Tech. rep. 1994 (cit. on p. 27).
[Xil12] Xilinx, Inc. Vivado Design Suite User Guide—High-Level Synthesis. 2012 (cit. on pp. 18,
44, 74, 166, 170).
[Xil13] Xilinx, Inc. XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices. Mar. 2013.
URL: http://www.xilinx.com/support/documentation/sw_manuals/
xilinx14_5/xst_v6s6.pdf (cit. on p. 105).
[Xil15] Xilinx, Inc. Vivado Design Suite User Guide—Synthesis. 2015 (cit. on pp. 33, 170).
[Zha+15] Chen Zhang, Peng Li, Guangyu Sun, et al. “Optimizing FPGA-based Accelerator Design
for Deep Convolutional Neural Networks”. In: Proceedings of the 2015 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. FPGA ’15. Monterey,
California, USA: ACM, 2015, pp. 161–170. DOI: 10.1145/2684746.2689060
(cit. on p. 44).
[ZL13] Zhiru Zhang and Bin Liu. “SDC-based Modulo Scheduling for Pipeline Synthesis”. In:
2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Nov.
2013, pp. 211–218. DOI: 10.1109/ICCAD.2013.6691121 (cit. on pp. 37–39, 42,
43).
[ZS03] Jihan Zhu and Peter Sutton. “FPGA Implementations of Neural Networks – A Survey
of a Decade of Progress”. In: 13th International Conference on Field Programmable
Logic and Application. Ed. by Peter Y. K. Cheung and George A. Constantinides.
Springer Berlin Heidelberg, 2003, pp. 1062–1066. DOI: 10.1007/978-3-540-
45234-8_120 (cit. on p. 185).
Bibliography 199

ASound Acceleration of Equivalent
Expression Discovery
The functions used in Section 3.3.2 can sometimes be slow to compute using a naïve
implementation. By using abstract interpretation and the properties of certain equivalent
expression generator (EEG) functions, we can further accelerate the computation. We start
by defining a property of the EEG functions used in Section 3.3.2 known as ∪-distributive,
then propose a new algorithm to accelerate the computation of clN f(), where f is a
∪-distributive EEG.
Corollary A.1. By the definition of I in (3.22), it is clear that I is ∪-distributive.
We then continue to prove that the algorithm CLOSURE(f,N, ) in Figure 3.1 indeed
computes clN f(). Firstly, we prove the following lemma:
Lemma A.1. clN f() =  ∪ f (clN−1 f()) for any ∪-distributive EEG function f .
Proof. Following (3.24), clN f() = f0()∪f1()∪· · ·∪fN (). Because f is ∪-distributive,
we apply distributivity to the right-hand side to derive:
clN f() =  ∪ f
(
f0() ∪ f1() ∪ · · · ∪ fN−1()
)
, (A.1)
which equals to  ∪ f (clN−1 f()) by definition.
Then this allows us to deduce that the algorithm indeed computes clN f():
Theorem A.1. In the algorithm in Figure 3.1, at iteration n, the set of equivalent expressions
sn computes exactly cln f(), if f is a ∪-distributive EEG.
201
Proof. We start by assuming that at iteration m > 0, sm = clm f(), and we prove this
equality still holds if substitute m with m+ 1. From the algorithm, we can deduce:





)− sm) = sm ∪ f (s′m) = sm ∪ f (f (s′m−1)− sm−1) .
We substitute sm using Lemma A.1 to get:






Using distributivity of f over ∪ and the iteration m of the algorithm, we can derive:







)− sm−1)) =  ∪ f (sm) .
Finally, we make use of the assumption sm = clm f(), followed by Lemma A.1 to show:
sm+1 =  ∪ f (clm f()) = clm+1 f().
It is trivial that s0 =  = cl0 f(), by induction, sn = cln f() thus holds for all n ∈ N.
Unfortunately, in the alternative method greedy_trace, we cannot make use of the
efficient algorithm in Figure 3.1 to compute clN (fr ◦Ik) directly. The reason is that in
general, fr ◦Ik is not ∪-distributive. Imagine two equivalent expressions e1 and e2, and
e1 strictly dominates e2, i.e. e1 is better than e2 in terms of accuracy and resources, then
we can observe that:
fr ({e1} ∪ {e2}) = fr ({e1, e2}) = {e1} , (A.2)
which is not equal to:
fr ({e1}) ∪ fr ({e2}) = {e1} ∪ {e2} . (A.3)
To resolve this, following abstract interpretation discussed in Section 2.3 of Chapter 2,
we can introduce a new abstract domain to the power set of equivalent expressions. This
202 Appendix A Sound Acceleration of Equivalent Expression Discovery
new domain reduces the computation effort of equivalent expression discovery, because
it is a simpler domain with fewer elements than the power set.




where its abstraction and concretization functions are defined as follows:
α() = fr(), γ(]) = ]. (A.5)
Specifically, α is identical to fr, and γ is simply the identity function, i.e. it returns the
input as its output.
The partial ordering on them (v) and the join operator (unionsq) can then be defined inductively
as follows using the Galois connection:

















Additionally, an abstract variant of an arbitrary EEG f can also be inductively defined:











Finally, the abstract variant of clN f can be proposed with a simple modification to the
algorithm in Figure 3.1, by replacing ∪ with unionsq in si ← si−1 ∪ s′i, we now have an abstract
closure function:





that operates with an abstract EEG f ]. The correctness of this formulation can be proved



















k = fr ◦Ik
can finally be accelerated using the modified algorithm.
204 Appendix A Sound Acceleration of Equivalent Expression Discovery
BFormal Definitions of Equivalent MIR
Discovery
In Section 4.5 of Chapter 4 informally explained how equivalent semantic expressions
and MIRs can be discovered. In this appendix we formally define the equivalent discovery
procedure for all newly introduced operators in Section 4.2, i.e. the ternary conditional,
composition, and fixpoint operators, and also extend this definition to MIRs in a similar
fashion.
Formally, we extend the optimization function O [·] : M → ΣE] → M , where M =
SemExpr ∪MIR, proposed in Section 3.3 of Chapter 3 to these above structures. For








 ?b′ e′1 e′2
∣∣∣∣∣∣∣
b′ ∈ O [b]σ],
e′1 ∈ O [e1]σ]|′b,
e′2 ∈ O [e2]σ]|′b

 , (B.1)
where the function fσ] is defined in Section 3.3.4 of Chapter 3.





















computes the Pareto-optimal set of equivalent semantic expressions from
the initial set , using the program state σ] to evaluate the quality metrics (i.e. round-off
errors and resource utilization).
205
































where K is the partial unroll factor limit. We set K = 3 in the experimental results
discussed in Section 4.7.
Finally, MIRs can also be optimized in a similar fashion, by recursively discovering
equivalent expressions within it:
O [µ]σ] = fr
[x 7→ ex]x∈var(µ)
∣∣∣∣∣∣ ey ∈ O [µ(y)]σ
],
y ∈ var (µ)
 , σ]
 . (B.4)
206 Appendix B Formal Definitions of Equivalent MIR Discovery
CBenchmark Source Code
In Chapters 4 and 5 we explore the experimental results of several numerical programs
as our benchmark examples. This appendix contains the source code of the benchmark
suite used.
#pragma soap in float x=[0, 20]
#pragma soap out x
while (x > 1.0) {
x = 0.9f * x;
}
Figure C.1. simple
#pragma soap in \
int n=[10, 20], float x=[-0.1, 0.1], float y=[0, 1]
#pragma soap out z
float a = 1;
int b = 1;
float p = 1;
float z = 0.0f;
for (int i = 0; i < n; i++) {
a = -a;
b *= (2 * i + 1) * (2 * i);
p *= (x + y) * (x + y);




#pragma soap in \
float a0=[0, 0.2], float a1=[0, 0.2], \
float a2=[0.0, 0.2], float b0=[0, 0.2], \
float b1=[0, 0.2], float b2=[0.0, 0.2], \
float x=[0, 1], int n=20
#pragma soap out y
float x1 = 0.0f, x2 = 0.0f;
float y1 = 0.0f, y2 = 0.0f;
float y = x;
for (int i = 0; i < n; i++) {
float yt = y;
y = b0 * x + b1 * x1 + b2 * x2 +







#pragma soap in \
float u=[0.0, 1.0], float w=[0.0, 1.0], \
int n=[0, 20], float dt=[0.1, 0.1]
#pragma soap out u, v
float u;
float v = 0.0f;
for (int i = 0; i < n; i++) {
float u0 = u + v * dt;




208 Appendix C Benchmark Source Code
#define n 20
#pragma soap in \
float kp=[9, 10], float ki=[0.5, 0.7], float kd=[0, 3], \
float dt=[0.2, 0.2], float m=8.0, float c=5.0,
#pragma soap out m
float i = 0.0f, e0 = 0.0f;
float m, e, d, r;
for (int j = 0; j < n; j++) {
e = c - m;
i += ki * dt * e;
d = kd * (e - e0) / dt;
r = kp * e + i + d;
e0 = e;




#pragma soap in float x[N] = [0.0, 1.0]
#pragma soap out sum
float sum = 0;
for (int i = 0; i < N; i = i + 1) {




#pragma soap in \
float x[100] = [0.0, 1.0], float z[100] = [0.0, 1.0]
#pragma soap out q
float q = 0.0f;
for (int k = 0; k < N; k++) {






#pragma soap in \
float x[n] = [0.0, 1.0], float y[n] = [0.0, 1.0], \
float z[n] = [0.0, 1.0]
#pragma soap out x
int l; int i;
for ( l=1 ; l<=loop ; l++ ) {
for ( i=1 ; i<n ; i++ ) {




// D := alpha*A*B*C + beta*D
#define N 1024
#pragma soap in \
float A[N][N] = [0.0, 1.0], float B[N][N] = [0.0, 1.0], \
float C[N][N] = [0.0, 1.0], float D[N][N] = [0.0, 1.0], \
float tmp[N][N] = [0.0, 1.0]
#pragma soap out D
int i; int j; int k;
float alpha = 32412;
float beta = 2123;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
tmp[i][j] = 0;
for (k = 0; k < N; ++k)
tmp[i][j] += alpha * A[i][k] * B[k][j];
}
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
D[i][j] *= beta;
for (k = 0; k < N; ++k)
D[i][j] += tmp[i][k] * C[k][j];
}
Figure C.9. 2mm
210 Appendix C Benchmark Source Code
// G = (A * B) * (C * D)
#define N 1024
#pragma soap in \
float A[N][N] = [0.0, 1.0], float B[N][N] = [0.0, 1.0], \
float C[N][N] = [0.0, 1.0], float D[N][N] = [0.0, 1.0], \
float E[N][N] = [0.0, 1.0], float F[N][N] = [0.0, 1.0], \
float G[N][N] = [0.0, 1.0]
#pragma soap out G
int i; int j; int k;
for (i = 0; i < N; i++) /* E := A*B */
for (j = 0; j < N; j++) {
E[i][j] = 0;
for (k = 0; k < N; ++k)
E[i][j] += A[i][k] * B[k][j];
}
for (i = 0; i < N; i++) /* F := C*D */
for (j = 0; j < N; j++) {
F[i][j] = 0;
for (k = 0; k < N; ++k)
F[i][j] += C[i][k] * D[k][j];
}
for (i = 0; i < N; i++) /* G := E*F */
for (j = 0; j < N; j++) {
G[i][j] = 0;
for (k = 0; k < N; ++k)




#pragma soap in \
float A[N][N] = [0.0, 1.0], float x[N] = [0.0, 1.0], \
float y[N] = [0.0, 1.0], float tmp[N] = 0
#pragma soap out y
int i; int j;
for (i = 0; i < N; i++) {
tmp[i] = 0;
for (j = 0; j < N; j++)
tmp[i] = tmp[i] + A[i][j] * x[j];
for (j = 0; j < N; j++)





#pragma soap in \
float A[N][N] = [0.0, 1.0], float s[N] = [0.0, 1.0], \
float q[N] = 0, \
float p[N] = [0.0, 1.0], float r[N] = [0.0, 1.0]
#pragma soap out q
int i; int j;
for (i = 0; i < N; i++)
s[i] = 0;
for (i = 0; i < N; i++)
{
q[i] = 0;
for (j = 0; j < N; j++)
{
s[j] = s[j] + r[i] * A[i][j];




// C := alpha*A*B + beta*C
#define N 1024
#pragma soap in \
float C[N][N] = [0.0, 1.0], float A[N][N] = [0.0, 1.0], \
float B[N][N] = [0.0, 1.0]
#pragma soap out C
int i; int j; int k;
float alpha = 32412;
float beta = 2123;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
C[i][j] *= beta;
for (k = 0; k < N; ++k)
C[i][j] += alpha * A[i][k] * B[k][j];
}
Figure C.13. gemm
212 Appendix C Benchmark Source Code
// Vector Multiplication and Matrix Addition
#define N 1024
#pragma soap in \
float A[N][N] = [0.0, 1.0], \
float u1[N] = [0.0, 1.0], float v1[N] = [0.0, 1.0], \
float u2[N] = [0.0, 1.0], float v2[N] = [0.0, 1.0], \
float w[N] = 0, float x[N] = 0, \
float y[N] = [0.0, 1.0], float z[N] = [0.0, 1.0]
#pragma soap out w
int i; int j;
float alpha = 43532;
float beta = 12313;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
x[i] = x[i] + beta * A[j][i] * y[j];
for (i = 0; i < N; i++)
x[i] = x[i] + z[i];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] + alpha * A[i][j] * x[j];
Figure C.14. gemver
// Matrix Vector Product and Transpose
#define N 1024
#define N 1024
#pragma soap in \
float x1[N] = [0.0, 1.0], float x2[N] = [0.0, 1.0], \
float y_1[N] = [0.0, 1.0], float y_2[N] = [0.0, 1.0], \
float A[N][N] = [0.0, 1.0]
#pragma soap out x1, x2
int i; int j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
x1[i] = x1[i] + A[i][j] * y_1[j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)





#pragma soap in float A[N][N] = [0.0, 1.0]
#pragma soap out A
int t; int i; int j;
for (t = 0; t < TSTEPS; t++)
for (i = 1; i < N - 1; i++)
for (j = 1; j < N - 1; j++)
A[i][j] = (A[i-1][j]
+ A[i][j-1] + A[i][j] + A[i][j+1] + A[i+1][j]) * 0.2f;
Figure C.16. seidel
// Symmetric rank-2k operations
#define N 1024
#pragma soap in \
float alpha = [0.0, 1.0], float beta = [0.0, 1.0], \
float A[N][N] = [0.0, 1.0], float B[N][N] = [0.0, 1.0], \
float C[N][N] = [0.0, 1.0]
#pragma soap out C
int i; int j; int k;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++) {
C[i][j] += alpha * A[i][k] * B[j][k];
C[i][j] += alpha * B[i][k] * A[j][k];
}
Figure C.17. syr2k
214 Appendix C Benchmark Source Code
