Data Representation Optimisation for Reconfigurable Hardware Design by Osborne, William George & Osborne, William George
Data Representation Optimisation for
Reconfigurable Hardware Design
William George Osborne
A thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computing
Imperial College London
180 Queen’s Gate, London, SW7 2BZ

 i
STATEMENT OF ORIGINALITY
The research outlined in this thesis was conducted in the Department of Computing
at Imperial College London. I declare that the work presented is my own, except
where acknowledged.
W G Osborne
Imperial College London
October 2011

 iii
ABSTRACT
One of the challenges of designing hardware circuits is representing the data in an
efficient way — minimising area and power while maximising clock frequency. There
are several ways of representing variables, each with different characteristics, such as
the effect arithmetic operations have on the absolute and relative error. In the first
part of this thesis, a new method of transforming arithmetic by combining different
numerical representations to exploit their advantages is discussed. The problem is
formulated as a set of linear equations which are then solved to find the optimal
solution. Algorithms that generate sub-optimal solutions are investigated because
they take a fraction of the time to run. A new reconfigurable device structure is
proposed based on the results presented. In this case, the accuracy of the original
application is guaranteed to be met regardless of the input data.
In many applications, guaranteeing that a transformed design has at least the
same accuracy as the original is not a strong enough constraint. For this reason, the
error on the output is guaranteed to be lower than a specified value. In the second
part of this thesis, accuracy reduction is investigated with the goal of minimising
circuit area. Energy-efficient run-time reconfigurable hardware is automatically
created by systematically deactivating parts of the circuit based on the accuracy
required. A model to determine the conditions under which reconfiguring the chip, if
this is possible, is more energy-efficient than multiplexing is shown. The approach is
expanded to general purpose processors; a new computational model — both software
and hardware architecture — to reduce the energy of future devices is introduced.

 v
ACKNOWLEDGEMENTS
I acknowledge my advisers, Prof Wayne Luk and Dr Oskar Mencer, for their support
while developing the ideas presented in this thesis. Special thanks to Dr Gabriel
Coutinho for the many discussions and insights which helped to improve the quality
of this research. The advice given by the Custom Computing group over the years
has proved invaluable.
I would also like to thank Dr Tim Todman for providing source code for a ray
tracer, one of the applications of the work presented, and Dr Kubilay Atasu for
guidance with integer linear programming.
I would like to express my gratitude to Dr Robert Clapp for his direction while
working at Stanford University on seismic imaging, an important application of this
work.
The support of FP6 HARTES (Holistic Approach to Reconfigurable Real Time
Embedded Systems) and UK EPSRC who funded the work in this thesis is gratefully
acknowledged as is the support of Xilinx, Altera, Celoxica and Agility who provided
the sophisticated tools required for cutting-edge research.

 vii
TABLE OF CONTENTS
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Selected Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Journal Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Conference Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Arithmetic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Number Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Range Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Precision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Bitwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Fixed-Point Design and Simulation Environment . . . . . . . . . . . . 15
2.2.6 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.7 MATCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.8 Error Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.9 Cost Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.10 Guaranteeing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.11 Architecture Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.12 Right–Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.13 Application to Processors . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Phase Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Word-Length Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Phase Characterisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 High-Level Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Source Code Analysis and Transformation . . . . . . . . . . . . . . . . 26
2.4.2 Arithmetic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Modelling Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
viii 
3 Reducing Circuit Area using Multiple Data Representations 35
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Floating-Point Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Embedded Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Number Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.2 Financial Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.3 Image Processing and Ray Tracing . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Additional Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Proposed Device Architecture . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 Scalable Accuracy-Guaranteed Word-Length Optimisation 69
4.1 Range and Precision Reduction . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Low-Effort Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.3 High-Effort Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.4 Application to Different Number Systems . . . . . . . . . . . . . . . . 80
4.2 Run-Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 Black-Box Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
 ix
5 Energy Reduction by Systematic Run-Time Hardware Deactiva-
tion 93
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.1 Word-Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.2 Reconfiguration with Multiplexers . . . . . . . . . . . . . . . . . . . . 96
5.1.3 Reducing Power Consumption in FPGAs . . . . . . . . . . . . . . . . . 97
5.1.4 Combining Reconfiguration Approaches . . . . . . . . . . . . . . . . . 98
5.2 Reconfiguration Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Reconfiguration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Reconfiguration Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Multiplexer Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.2 Bitstream Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3 Comparing Reconfiguration Strategies . . . . . . . . . . . . . . . . . . 108
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.1 Inner Product and Vector Multiplication . . . . . . . . . . . . . . . . . 110
5.4.2 Uniform Cubic B–Splines . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.3 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5 Proposed Model of Computation . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Summary and Conclusions 119
6.1 Combining Multiple Data Representations . . . . . . . . . . . . . . . . . 120
6.2 Scalable Word-Length Optimisation . . . . . . . . . . . . . . . . . . . . . 122
6.3 Systematic Run-Time Hardware Deactivation . . . . . . . . . . . . . . . 125
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.1 Combining Multiple Data Representations . . . . . . . . . . . . . . . . 127
6.4.2 Scalable Word-Length Optimisation . . . . . . . . . . . . . . . . . . . 127
6.4.3 Systematic Run-Time Hardware Deactivation . . . . . . . . . . . . . . 128
A Appendix: Reducing Circuit Area using Multiple Data Represent-
ations 131
A.1.1 Ray Tracer Architectural Description . . . . . . . . . . . . . . . . . . . 131
A.1.2 Cost Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
B Appendix: Scalable Accuracy-Guaranteed Word-Length Optimisa-
tion 133
B.1.1 Source Code Annotations . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.2 Heuristic Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Bibliography 148

 xi
LIST OF FIGURES
2.1 A simplified diagram of an FPGA logic block . . . . . . . . . . . . . . . 8
2.2 CAST objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 An outline of the methodology to optimise data representation . . . . . . 39
3.2 Relative and absolute accuracy of floating-point and fixed-point . . . . . 41
3.3 Floating-point accuracy compared with fixed-point accuracy . . . . . . . 42
3.4 Simplified structure of a floating-point adder and multiplier . . . . . . . 44
3.5 B–splines area (LUTs) with multiple number systems on different FPGAs 45
3.6 Simplified diagram of the DSP48 on Virtex 4 FPGAs . . . . . . . . . . . 46
3.7 Constraints affect on ILP run time . . . . . . . . . . . . . . . . . . . . . 50
3.8 An example illustrating data representation constraints . . . . . . . . . . 52
3.9 Generalisation of the data representation problem . . . . . . . . . . . . . 55
3.10 Area (LUTs) and algorithm run time for a convolution with multiple
number systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.11 Area (flip-flops) of a convolution with multiple number systems . . . . . 58
3.12 Area (LUTs) of the generalised autoregressive conditional heteroskedasti-
city (GARCH) financial model with multiple number systems . . . . 60
3.13 Area (LUTs) of the ray tracer with multiple number systems . . . . . . . 60
3.14 Combining word-length and data representation optimisation . . . . . . . 63
3.15 Proposed architecture to reduce circuit area . . . . . . . . . . . . . . . . 65
3.16 Area of a convolution circuit with the proposed architecture . . . . . . . 66
4.1 Example data-flow graph to illustrate the precision analysis problem . . . 72
4.2 Area and algorithm run time at varying partition sizes for the convolution
benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 A summary of the word-length reduction algorithm . . . . . . . . . . . . 78
4.4 Area and algorithm run time for the B–splines benchmark at varying
levels of precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Area and algorithm run time for the B–splines benchmark with variable
output precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 Function approximation for square root and logarithm . . . . . . . . . . 82
xii 
4.7 An example showing how conditional statements affect error . . . . . . . 84
4.8 An example showing how control-flow analysis can reduce energy con-
sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.9 Area and algorithm run time for the ray tracer benchmark with variable
output precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.10 Area and algorithm run time for the convolution benchmark at varying
levels of precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 Area and algorithm run time for the convolution benchmark at varying
levels of precision using a partitioned data-flow graph . . . . . . . . . 89
4.12 Algorithm run time for the convolution benchmark with twice the number
of variables at varying levels of precision . . . . . . . . . . . . . . . . 90
5.1 A model of a reconfigurable circuit . . . . . . . . . . . . . . . . . . . . . 96
5.2 Clock gating in an ASIC and FPGA . . . . . . . . . . . . . . . . . . . . 97
5.3 Area and power consumption of the B–splines benchmark . . . . . . . . . 101
5.4 Run time and energy against reconfiguration interval . . . . . . . . . . . 103
5.5 Power saving for a 64-bit multiplier by reducing the precision . . . . . . 107
5.6 Average run time, above which multiple bitstream reconfiguration becomes
more efficient than multiplexing . . . . . . . . . . . . . . . . . . . . . 109
5.7 Area against word-length for a constant inner product and constant vector
multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.8 Power consumption against word-length for a constant inner product and
constant vector multiplier . . . . . . . . . . . . . . . . . . . . . . . . 111
5.9 Area and power consumption of the ray tracer with varying output accuracy112
5.10 Two circuits to reduce switching activity . . . . . . . . . . . . . . . . . . 115
5.11 Example source code showing how to reduce energy . . . . . . . . . . . . 116
5.12 Example source code to further reduce energy . . . . . . . . . . . . . . . 117
A.1 Ray tracer architectural description . . . . . . . . . . . . . . . . . . . . . 131
B.1 Source code annotations for a ray-sphere intersection . . . . . . . . . . . 134
B.2 Area and algorithm run time for the Gaussian blur benchmark at varying
levels of precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.3 Area and algorithm run time for the Gaussian blur benchmark at varying
levels of precision using a partitioned data-flow graph . . . . . . . . . 135
B.4 Area and algorithm run time for the RGB to YCbCr conversion benchmark
at varying levels of precision . . . . . . . . . . . . . . . . . . . . . . . 136
CHAPTER 1
Introduction
There are three key challenges when solving computationally demanding problems:
increasing performance, reducing energy and reducing development time. One method
of increasing performance and reducing energy is to change the way variables are
stored and manipulated. Every variable in an algorithm is represented in a given
number system, each of which has advantages and disadvantages. There are several
common number systems:
Floating-point is commonly used on general purpose processors and dedicated
graphics hardware, however, high performance has been obtained on reconfigurable
devices such as field-programmable gate arrays (FPGAs) [124]. The most common
representation, the IEEE 754 standard, defines two formats: single and double
precision:
−1Sign × 1.Mantissa× 2Exponent−Bias
where Sign is 1 bit wide, Exponent is 8 bits wide and Mantissa is 23 bits wide
assuming single precision. Floating-point exponent values have a bias subtracted from
them (127 for single precision) to enable negative values to be stored while enabling
efficient comparison. Flexible floating-point formats have been proposed [41, 62] along
with parameterised architectures [70] allowing area, latency — the time taken between
data being input and a result being generated — and throughput trade-offs [90].
Fixed-point is frequently used in FPGAs because it results in smaller units with
lower latency than those employing a floating-point representation assuming that a
high sensitivity around zero is not required. It has the following representation:
Range · Precision
where Range corresponds to the integer part and Precision corresponds to the
fractional part of a variable.
2  Chapter 1: Introduction
Dual fixed-point [44] is similar to fixed-point but can be scaled by two different
values, increasing the range of numbers that can be accurately represented. The
binary point may be in one of two positions determined by the value of the exponent
bit.
Block floating-point is a variant of floating-point in which variables with the same
exponent are grouped together. When the exponent is not the same, bits will be
wasted in the mantissa. This saves exponent bits, potentially increasing bandwidth.
The logarithmic number system is often used for applications in which a large
number of multiplication operations are clustered together. Instead of storing the
value directly, its logarithm is stored, resulting in more efficient multiplication than
fixed-point and floating-point [61] but less efficient addition.
Field-programmable gate arrays (FPGAs) and application-specific integrated
circuits (ASICs) are not restricted to a single representation. Given that extra
hardware may be required to convert between formats, the area and power may
increase. It is therefore not clear which number system should be used for each
arithmetic operator to minimise area and power. It may be possible to use dedicated
hardware resources for some operators on an FPGA, providing that the operator
uses a given number system. The dedicated resources are limited and must be shared
among every operation in the data-flow graph. The number of available resources
will, in many cases, affect the choice of data representation for every operator,
whether constructed with dedicated resources or not. This will happen because the
representation of one operator can affect the representation of all operators connected
to it. A change in the representation of one operator leads to a choice: should the
operators connected to it also change representation or maintain their representation
given that additional conversion logic may need to be added?
As well as the number system, the word-length of a variable — the number of
bits used to represent the data — must be calculated. It has been shown [19] that
bits are wasted in high-level applications. This means that part of the circuit does
not contribute to the result and should be switched off or removed. This problem
occurs in processors, FPGAs and ASICs. Each variable has an associated range and
precision (exponent and mantissa in floating-point systems). The range must be
large enough to represent the integer part of the variable to avoid an overflow. The
precision is calculated either by guaranteeing the accuracy of the software application
or by providing output error constraints and using an algorithm to search for the
optimal solution. Both approaches are investigated in this thesis.
 3
To select the optimal representation, the accuracy is assumed to be the same as
the software application — the precision of floating-point variables is guaranteed
(chapter 3). This means that if fixed-point were to be used, the precision would
have to be at least 24 bits (unless a fixed-width constant were assigned to the
variable). Representing variables in fixed-point often requires additional bits because
floating-point has a greater sensitivity around zero. To optimise the word-length,
output precision constraints are used to determine the width of every variable using
worst-case error analysis (chapter 4). Statistical analysis is not used to estimate
error because it does not guarantee the output accuracy and may therefore require
extensive simulation. The goal is to automatically optimise the data representation
at compile time.
The data representation problem can be summarised as follows. A directed
graph is constructed representing the application. Each node in the graph represents
an arithmetic operation, for example, multiplication. Edges exist between nodes
in the data-flow graph, and every node must have a representation, for example,
floating-point. For a set of representation choices, rep1, rep2, rep3, ... and resources,
r1, r2, r3, ..., a node can be constructed from one of all valid combinations. The data
representation of all of the nodes in the graph must remain consistent. This means
that if there is an edge from node x to node y in the graph, denoted edge(x, y), the
representation of the output of node x must be converted to the representation of
node y, if they are not equal. Conversion may have an area overhead. To complicate
the problem a device contains a limited number of dedicated resources. Realising
a node in hardware requires a set of dedicated resources and a calculated amount
of area, for example, lookup tables, registers and multiplexers on an FPGA. Each
operator has a given size, in the case of a multiplier, both inputs and its output have
a given number of bits. The cost of an operator1 depends on its function, the widths
of a subset of its inputs and outputs (the subset being dependent on the function),
the type of resource it is constructed from and the architecture of the device used.
The questions addressed in this thesis are:
1. What is the optimal representation of each operator in the data-flow graph
with regards to a performance metric, for example, area and a given number of
dedicated resources? Can the cost be reduced if multiple number systems are
used in the same circuit (chapter 3)?
1Cost refers to area, clock frequency, power consumption etc.
4  Chapter 1: Introduction
2. What is the optimal size of all of the input and output operands of an operator
with regards to a performance metric? Can an algorithm that runs rapidly,
regardless of the size of the problem, produce near-optimal solutions (chapter 4)?
Finding optimal solutions to these problems may not be possible given that large
software applications contain thousands of variables, so suboptimal solutions are
found. The trade-off between algorithm run time and how close the solution is to
being optimal is investigated.
It has been shown [118] that control-flow analysis can be employed to reduce
the area of parts of a hardware design that are used infrequently. If a single input
value were to change, the control-flow could change causing a change in the optimum
size and representation of each operator [69, 119]. In this thesis it is shown that
word-length optimisation can be used in a similar way to reduce energy (chapter 5).
Three problems must be solved:
1. Where to reduce energy The parts of the design that can be deactivated
when not required must be identified based on the accuracy of each phase
(section 5.1).
2. When to reconfigure The different phases of execution must be extracted:
each phase having a given accuracy configuration (section 5.2).
3. How to reconfigure The reconfiguration strategy must be selected to reduce
energy based on the length of time each phase is active and the amount of
hardware to be deactivated. Some methods of reconfiguration may not be
possible on certain devices, for example, ASICs cannot have the circuit updated
in the same way that an FPGA can (section 5.3).
Calculating which variables can have their width reduced at compile time can reduce
the power consumption and therefore the energy required. Determining when and
how to reconfigure the chip — allowing the circuit to adapt at run time — can enable
further reduction. Although an ASIC can only be partially reconfigured, energy can
still be reduced by adding additional hardware to switch off unused bits of precision.
1.1 Contributions  5
1.1 Contributions
This thesis has 3 contributions:
An approach to optimising the data-formats and sizes used in a high-level softwareChapter 3
application to enable it to be efficiently realised in hardware (section 3.2). An integer
linear programming (ILP) formulation of the approach is created to generate optimal
solutions. This is compared with simulated annealing (section 3.3). Simulated
annealing generates solutions more quickly, although they are not guaranteed to be
optimal. These solutions are then extended by exploiting device characteristics to
produce circuits optimised under resource and bandwidth constraints.
An algorithm to generate area-efficient hardware designs rapidly, based on compile-Chapter 4
time word-length optimisation techniques which guarantee accuracy. Aggressive
heuristics are applied to determine non-uniform word-lengths while meeting con-
straints from an error function (section 4.1). Data gathered at run time enables
functions using an unknown algorithm to have their width determined; it is also shown
how this information can be used to reduce the energy requirement (section 4.2).
Two methods, multiplexer-based reconfiguration and bitstream reconfiguration, areChapter 5
combined with word-length optimisation to develop run-time reconfigurable circuits
(section 5.1). The conditions under which multiplexer-based reconfiguration is more
efficient than multiple bitstream reconfiguration are derived (section 5.2). The two
reconfiguration strategies are then compared and approaches to minimising the power
consumption while multiplexing components are investigated (section 5.3).
1.2 Thesis Structure
This thesis is structured as follows:
Chapter 2 provides background information and related work in this area.
Chapter 3 addresses the question of how data should be represented in hardware
circuits. It is shown that using a combination of data types can reduce circuit area
despite the overhead of representation conversion.
Chapter 4 shows an approach to compile-time accuracy-guaranteed word-length
optimisation which generates near-optimal solutions rapidly.
Chapter 5 extends the techniques outlined in chapter 4 by looking at the effect of
adapting the accuracy of the system at run time to cater for a changing environment.
Chapter 6 summarises the work with an analysis of the results presented in the
previous chapters.
6  Chapter 1: Introduction
1.3 Selected Publications
1.3.1 Journal Publications
W.G. Osborne, W. Luk, J.G.F. Coutinho, O. Mencer. Energy Reduction by
Systematic Run-Time Reconfigurable Hardware Deactivation. Trans-
actions on HiPEAC, volume 4, issue 4, 2009.
H. Fu, W. Osborne, R.G. Clapp, O. Mencer, W. Luk. Accelerating Seis-
mic Computations Using Customized Number Representations on
FPGAs. EURASIP Journal on Embedded Systems, volume 2009. Hindawi
Publishing Corporation 2009.
1.3.2 Conference Publications
W.G. Osborne, W. Luk, J.G.F. Coutinho, O. Mencer. Reconfigurable
Design with Clock Gating. In Proceedings of the IEEE International
Conference on Embedded Computer Systems: Architectures, Modeling and
Simulation, pages 187–194. IEEE, July 2008.
W.G. Osborne, J.G.F. Coutinho, W. Luk, O. Mencer. Power-Aware and
Branch-Aware Word-Length Optimization. In Proceedings of the IEEE
International Symposium on Field-Programmable Custom Computing Machines,
pages 129–138. IEEE Computer Society Press, April 2008.
W.G. Osborne, J.G.F. Coutinho, W. Luk, O. Mencer. Instrumented Multi-
Stage Word-Length Optimization. In Proceedings of the IEEE Interna-
tional Conference on Field-Programmable Technology, pages 89–96, December
2007.
CHAPTER 2
Background
In this chapter, previous approaches to solving the problems outlined in the intro-
duction are discussed. These include:
1. Calculating the optimal representation, range and precision of each operator
and operand in a data-flow graph (section 2.2).
2. Locating part of the hardware design to adapt at run time to reduce energy
and increase performance (section 2.3).
3. Software transformation and analysis facilitating high-level hardware design
(section 2.4).
The chapter is concluded by showing the advantages of the proposed approach over
those discussed.
8  Chapter 2: Background
LUT Carry
Logic
 D           Q
 RST
            CE
LUT inputs
Carry
chain
User-controlled 
Multiplexer
Figure 2.1: A simplified diagram of an FPGA logic block containing a lookup table (LUT)
capable of realising any 4-input, 1-output logic function, a flip-flop to enable a pipeline to
be constructed and multiplexers to route data around the chip.
2.1 Field-Programmable Gate Arrays
When solving computationally demanding problems, for example, seismic imaging
(FK migration [50]), there are several options:
General purpose processors.
Digital signal processors.
Dedicated graphics hardware.
Application-specific integrated circuits (ASICs).
Field-programmable gate arrays (FPGAs).
Due to the functions used — square root, sine and cosine — in seismic imaging
applications, it may be inefficient to use digital signal processors or dedicated graphics
hardware. For this application, performance improvements of 5–7 times have been
shown using FPGAs over general purpose processors. This is in part due to the
flexibility of FPGAs, enabling the use of a variety of number systems in the same
hardware design. General purpose processors do not have this functionality because
they have a fixed architecture. ASICs can also be used to create an architecture
specific to a given problem, however, once created it cannot be altered.
Field-programmable gate arrays are reconfigurable devices containing the following
components.
2.2 Arithmetic Analysis  9
Lookup tables (LUTs) capable of calculating any logical function with a given
number of inputs — typically between four and eight in current devices — and
a single output.
Flip-flops allowing a pipeline to be constructed, increasing throughput.
Multiplexers to select and route data around the chip.
Dedicated arithmetic to increase the performance of commonly used functions,
such as multiplication.
Memory units to store data efficiently.
Carry logic reducing the area required to construct an adder.
A large routing matrix to channel data between components in a specified
manner.
Configuration memory to store the current circuit layout.
A simplified architecture of the basic unit of an FPGA is shown in figure 2.1.
To improve efficiency and flexibility, additional logic gates and multiplexers are
sometimes included.
This flexibility comes at a cost [84]: a 40 times increase in area and a reduction in
clock frequency of 3.2 times compared with an ASIC. For this reason, FPGAs contain
dedicated hardware, commonly multipliers, to reduce this performance gap. For the
remainder of this thesis it is assumed that FPGAs are used, however, methods of
applying the techniques to ASICs and processors are investigated.
2.2 Arithmetic Analysis
Every variable in a data-flow graph has a representation, range and precision. The
representation must be selected carefully to reduce the area and power consumption
of a hardware circuit while increasing its maximum clock frequency. Research
into selecting the most efficient number system is discussed in section 2.2.1. The
word-length of a variable — its combined range and precision — is discussed in
sections 2.2.2 and 2.2.3.
10  Chapter 2: Background
Number System Add/Subtract Multiply Divide
Floating-point medium medium high
Fixed-Point low medium/high high
Logarithmic high low low
Table 2.1: Relative cost of constructing operators in different number systems on an FPGA
or ASIC.
Approaches to finding the optimal representation for each variable can be split
into two categories: compile-time and run-time. Compile-time approaches [87, 108]
tend to take less time, but they may overestimate the size of a function because the
worst-case must be assumed. Approaches based on simulation [2, 83, 111] (application
profiling) do not usually guarantee the accuracy of results because they are dependent
on the input data. In some cases it may also be possible to add run time safeguards
to the system, however, this is not done here because the safeguards are application-
and data-dependent. When optimising the word-length of a variable, several factors
must be taken into account, for example, the error requirement, the availability of
test data and the format, for example, floating-point.
2.2.1 Number Systems
The number system corresponds to the way variables are stored and manipulated
by arithmetic functions. Each format has advantages and disadvantages, discussed
in chapter 1. Several characteristics are used to determine suitability for a given
application.
Dynamic range — the range of values that can be accurately represented.
Relative cost of each operation.
Accuracy.
Floating-point values are stored with an exponent and mantissa,
−1Sign × 1.Mantissa× 2Exponent−Bias
and thus have a large dynamic range. As the value increases, its precision drops
because the width is fixed. For this reason, fixed-point may be used which has a fixed
range (integer part) and precision (fractional part). The relative cost of constructing
different operators in each number system must also be considered. Table 2.1 shows
2.2 Arithmetic Analysis  11
that the choice depends on the type and quantity of the different operators in the
data-flow graph. Consider the structure of fixed-point and floating-point multipliers.
Two stages may be added.
Rounding. Floating-point functions usually support rounding (IEEE-754 stand-
ard) to increase the accuracy given the constrained number of bits. Fixed-point
functions may support rounding to reduce their size, given an error constraint.
Exponent addition.
Floating-point addition, however, requires more complex hardware. A barrel shifter
— hardware able to shift a register by a given number of bits in a single clock cycle,
if there is only one pipeline stage — aligns the binary points which increases the
size of the unit (approximately nlog(n) multiplexers for an n-bit variable). Number
systems have therefore been suggested to take advantage of the large dynamic range
of floating-point and the relatively low cost of fixed-point operators. Dual fixed-
point [44] uses an exponent to increase the dynamic range of fixed-point variables,
however, since the exponent is limited to 1 bit, addition is not as complex because a
barrel shifter is not required.
Selecting the correct format relies on knowing the accuracy requirements. When
the absolute accuracy must be guaranteed, fixed-point is often selected. When a high
accuracy around zero is required, floating-point is often chosen because the accuracy
is maintained no matter how small the variable. Floating-point may suffer from a
loss of accuracy when values are added with different exponents because as one value
is shifted, the accuracy may be reduced. For this reason, designing floating-point
libraries is not trivial.
In the following sections the width reduction of operators and the effect this has
on the error produced is discussed.
2.2.2 Range Analysis
Range analysis corresponds to reducing the size of the non-fractional part of a
variable [100, 116, 117]. One of the most common approaches is to simulate the
design, either by modifying the source code, or operator overloading. This provides
very tight bounds on the range but could underestimate it if the test set were not
large enough. Test data are not always available so the range must be calculated at
12  Chapter 2: Background
compile time. The simplest method, interval arithmetic [100], involves propagating
the input range, or interval, through the data-flow graph to calculate an output
range. Interval arithmetic has the following rules:
[a, b] + [c, d] = [a+ c, b+ d]
[a, b]− [c, d] = [a− d, b− c]
[a, b]× [c, d] = [min(ac, ad, bc, bd),max(ac, ad, bc, bd)]
[a, b]
[c, d]
=
[
min
(
a
c
,
a
d
,
b
c
,
b
d
)
,max
(
a
c
,
a
d
,
b
c
,
b
d
)]
Problems have been found with this method if correlations exist [87]. Consider x¯− x¯,
where x¯ represents the interval of x. This does not equal zero if interval arithmetic
is used but rather [xmin − xmax, xmax − xmin].
To improve the results obtained from interval arithmetic, affine arithmetic [117]
is introduced which takes into account correlation. Each variable in affine form has
correlation coefficients that may appear multiple times, represented as follows:
xˆ = x0 + x11 + x22 + ...+ xnn, where i = [−1, 1]
A range is converted to an affine form as follows:
x0 =
1
2
(xmax + xmin), x1 =
1
2
(xmax − xmin)
Once the intervals [xmin, xmax] are in the form of xˆ, they can be added, multiplied
and converted back to intervals when required. A disadvantage of this approach is
shown when multiplying expressions of the following form:
(xi)(yi)
The conservative approximation [87] is:
xyi+1
Other methods exist to calculate this, for example, Chebyshev approximation but
these are significantly more computationally intensive (see [117] for a more detailed
overview). It is not always possible to calculate every range. Some correlation
information may therefore be lost.
Fang et al. [45] make use of affine arithmetic to calculate the range and precision
of floating-point variables. The authors show that the error calculated using interval
arithmetic is over double the error calculated using affine arithmetic for an inverse
discrete cosine transform.
2.2 Arithmetic Analysis  13
2.2.3 Precision Analysis
Precision analysis corresponds to calculating the number of bits needed to store the
fractional part of a variable while maintaining a specified accuracy. As the number
of bits required decreases, the size of the hardware circuit decreases, the power
decreases but the lower the accuracy; in an image processing application this may
lead to poorer image quality [50]. Several factors must be considered in order to
achieve the best results:
Choice of number system. Fixed-point variables have a constant precision width,
regardless of range, however, the precision width of floating-point variables
decreases as the range increases, for a given word-length.
Complexity. Finding the optimal precision for each variable is NP–Hard [28].
Error metric. Providing a termination condition can often prove challenging.
For example, it is not always clear whether the output is acceptable, as is the
case with seismic image processing [50]. In this case, the key is determining
whether patterns exist; the individual pixel values are less important.
Cost metric. The determination of suitable cost metrics is often difficult,
particularly if library functions are used which cannot be fully analysed.
The tolerance to error. There are three options: guaranteeing accuracy, es-
timating accuracy based on statistical conditions or meeting the accuracy of
another design (such as a software application). Two of these conditions are
explored in this thesis (chapters 3 and 4). Meeting statistical conditions usually
requires extensive simulation and can therefore not be done at compile time.
For this reason, this method is not explored in this thesis.
The level at which the analysis should occur: behavioural or structural, for
example, Handel-C or VHDL, or after resource mapping.
There are two approaches to precision analysis: compile-time and run-time. Cmar et
al. [27] identify the key disadvantages of each.
Simulation can be time-consuming, and solutions created in this way may still
result in the overflow of a variable, producing incorrect results.
14  Chapter 2: Background
Compile-time analysis of the source code can lead to circuits that are larger
than required.
Various approaches to range and precision analysis are discussed in the following
sections.
2.2.4 Bitwise
The Bitwise [116] project uses the SUIF [127] compiler framework to automatically
calculate the word-length of integers and pointers. This idea stems from the fact that
processor architectures are continually being extended to provide parallel instructions,
for example, MMX [106] and SSE, increasing the need to save area and power. A
common form of word-length analysis is used in which the range is propagated. This
approach results in a large area reduction, however, the authors note that this does
not allow the least significant bits to be removed. As an example, consider a binary
variable that is assigned the following values: 2, 4 and 6. In this case, the least
significant bit is always zero, so it can be removed. If the range were stored in the
form [2,6], this could not happen.
The tool is applied to fixed-point operators. Although no constraints are required,
these may be used to further restrict the range. It is a compile-time approach which
does not require any run-time information but instead uses a data-flow graph to
propagate data forward and backward. Backward propagation is applied because
variables may sometimes exceed known limits, for example, taking the square root of
a negative number. If this occurs, starting from this operator, the value is propagated
backward to reduce the range. Single statement assignment (SSA) is applied so
that each variable can have its range restricted. Control-flow information further
restricts the range. If a value were compared with zero, x < 0, the true branch
of the control-flow graph, if taken, would result in the range of x being negative.
It is emphasised that loops may pose a significant problem given that the number
of iterations, or the growth factor of variable ranges within the loop, cannot be
accurately calculated. A pointer analysis [109] detects pointer aliases; arrays are
then flattened into scalar variables in order to simplify the analysis. Although this
may provide a conservative range estimate, this is common practice when analysing
arrays because the width of elements tend to be similar.
Using this tool results in a 12–80% area improvement for applications containing
scalars and up to 93% area improvement for applications containing arrays. As well
2.2 Arithmetic Analysis  15
as area, the latency along the critical path can be reduced; up to a 3 times increase
in clock frequency is obtained for constant-coefficient multipliers. Power savings of
71% for an ASIC design running at 200MHz are achieved.
2.2.5 Fixed-Point Design and Simulation Environment
Many hardware circuits employ fixed-point functions because they tend to be smaller
than their floating-point equivalent. The fixed-point design and simulation environ-
ment [77] enables floating-point applications to be transformed into fixed-point and
simulated accurately [76]. The system takes as input an annotated floating-point
application developed in ANSI–C. To enable information about fixed-point variables
to be added to the ANSI–C description, the language has been modified to encompass
fixed-point data types, including range and precision information. The design flow is
as follows:
1. Annotations are added to the design where the format and word-length are
known, for example, integer variables with a known width. In this case, a
fixed-point format is adopted, although there is no reason why this approach
could not be extended to other formats.
2. All operations that have a floating-point representation are converted to fixed-
point based on the information gathered by simulating the application. On
FPGAs, fixed-point units tend to be smaller, although this is not always the
case (see chapter 3 for details).
An interpolation-based approach is applied to calculate any word-lengths that have
not been specified. Simulating the system determines whether the output meets the
accuracy requirements. The remaining floating-point variables are given a width by
interpolation using the following annotations: variable mean and variance, relative
error and maximum error [126]. This approach has two key benefits:
The design time is reduced because the approach is partially automated.
The design space can be explored with manual intervention by constraining
specified variable widths.
The authors [77] highlight the problem of a variable being assigned twice. Two
options have been proposed: the first is to take the maximum word-length for that
variable (which may be inefficient), the second method requires that the variable be
renamed. In this case, each time a variable is assigned, it may use a different format.
16  Chapter 2: Background
2.2.6 Automatic Differentiation
Simulation is a common approach to analysing accuracy in software applications [2]
because floating-point accuracy is dependent on the range a variable can take.
Automatic differentiation [75] — the differentiation of functions in a program —
is one way of using simulation to calculate precision. Given a function, f(x), it
is differentiated to produce f ′(x). The input word-length can be calculated if the
output word-length of the function is known, given the formula:
∆y ≈ dy
dx
×∆x
where ∆x represents the error on x. The following equation [2] relates the error to
the width of the mantissa.
2m2 =
1
2−m1 − ∆X/2
2Emax
where m1 and m2 are the mantissa widths, ∆X is the error caused by reducing the
number of bits of the mantissa and Emax is the maximum value that the exponent
can take. The accuracy of variables can be propagated through the data-flow graph,
given the output accuracy, to calculate errors on internal variables.
The process of differentiating a program can be applied in two ways: source
code transformation and operator overloading [54]. Operator overloading is the least
efficient of the two since the compiler cannot always optimise the source code to the
same degree. Source code transformation is more difficult because it requires the
source code be parsed and modified without affecting the result; the work presented in
this thesis extends ROSE [107] (section 2.4) to tackle this problem. A C++ program
is modified such that each operator under analysis has been overloaded with user-
defined classes which store the fixed-point representation of each variable. A forward
pass is required to generate a data-flow graph representing the program; a backward
pass then minimises a cost function.
Area reductions of up to 65% are shown for a 1% increase in error on the output
for a fast Fourier transform and a 40% increase in clock frequency for a discrete
Fourier transform with 5% error on the output.
2.2.7 MATCH
Nayak et al. use the MATLAB compiler for heterogeneous adaptive computing
systems (MATCH) [60] to synthesise MATLAB designs to hardware circuits with op-
timised data representation. MATLAB is chosen due to the availability of predefined
2.2 Arithmetic Analysis  17
Equation Error Propagation
a = b + c aerror = berror + cerror
a = b× c aerror ≤ (|b| × cerror) + (|c| × berror) + 2−width(a)
Table 2.2: MATCH error propagation functions [60].
matrix, vector and signal processing libraries and the lack of some of the intricacies
of C/C++ and other high-level languages, for example, pointers. The design flow can
be split into the following stages:
MATLAB code is parsed and an abstract syntax tree (AST) produced.
Vectors and matrices are flattened to produce scalar variables with loops.
Consider the following example in which A and B are vectors and n is the size
of each vector.
A = B × 10; → for (i = 0; i < n; i = i+ 1)
a[i] = b[i]× 10;
Expressions are split up into their constituent parts.
The range of input variables is propagated through the data-flow graph [102] to
estimate the range of internal variables. The errors on the inputs, created by
truncating variables to a given number of bits, are used to calculate the error on
output variables. Consider y = a+ b. Given that a and b have an associated error,
the expression can be rewritten as y = a + aerror + b + berror. Table 2.2 shows the
error propagation functions. It also shows that precision and range are correlated; a
narrower variable range on the inputs to a function may lead to a reduced precision on
the output. In this case a component of the error on the multiplication, berror× cerror,
is omitted because its contribution to the overall error is considered small. Similar
error models are adopted in [24].
Pre´cis [23] is an extension of the MATCH compiler that gives the developer hints
about where to modify the precision, ultimately increasing performance.
2.2.8 Error Heuristic
Roy and Banerjee [108] have extended the MATCH [60] compiler with an algorithm
based on reducing word-lengths to optimise MATLAB designs. As with the MATCH
18  Chapter 2: Background
compiler, the range of each variable is calculated by propagating ranges through the
data-flow graph. Coarse-grain and fine-grain optimisation are combined to reduce
the precision of variables, followed by simulation to determine the error produced.
Word-length optimisation is NP–Hard [28]. For this reason, heuristics are often
used in order to calculate a solution. The first stage is a coarse-grain analysis using
a binary search to find a uniform word-length, greatly reducing the search space
in most cases. It is unlikely that the width of inputs to an operator will vary by
a large amount, although this can occur if the range is much larger on one input;
some examples of this are shown in chapter 4. Following this, fine-grain optimisation
selects a word-length to decrease based on the error introduced as a result. Given a
set of word-lengths of length n:
p0, p1, p2, p3, p4, ..., pn−1
∀i
{
pi > 0 reducedi = p0, p1, pi − 1, p2, ..., pn−1
otherwise reducedi = p0, p1, pi, p2, ..., pn−1
The set, reducedi, resulting in the lowest error is chosen and the process repeated.
When no further reduction can be taken without breaking the error requirement, the
process terminates. The second stage increases selected word-lengths with the aim
of allowing other word-lengths to be reduced. This is done in order to explore the
search space more fully, improving the final result.
increased = p0, p1, pi + 1, pj − 2, ..., pn
The element that results in the greatest decrease in error (pi) is increased and
the element that results in the smallest increase in error is decreased (pj). The
word-lengths are then reduced again.
2.2.9 Cost Heuristic
As stated in section 2.2.1 there are several different number systems, designed for
different tasks. Fang et al. [46] present a methodology for reducing the word-length
of floating-point variables in C++ programs. Operator overloading allows the design
to be simulated accurately; this approach has also been used in [78] with a uniform
precision width. Signals are grouped together based on their word-length in order
to keep the number of different floating-point formats to a minimum. In general,
the greater the number of groups, the greater the difference will be between the
2.2 Arithmetic Analysis  19
variable widths and the greater the cost reduction. A heuristic algorithm reduces
the word-length using the mean square error. The use of optimised floating-point
units result in a 64.9% reduction in power consumption and a 66.8% reduction in
area; the use of fixed-point units result in a 31.9% reduction in power consumption
and a 44.8% reduction in area.
A simple cost model takes into account the width and type of each operator to
reduce power consumption:
cost =
∑
op∈operators
power(widthop, typeop)× frequencyop
where widthop, typeop and frequencyop are the characteristics of arithmetic operator
op. The frequency is required if power is being minimised but not area. The
architecture of the device is factored into the variable grouping because each embedded
functional unit can only support a limited number of types. An aggressive cost
heuristic allows a near-optimal solution to be generated. The first stage is to construct
the following set.
∀i {reducedi = p0, p1, pi − j, ..., pn−1}
where j cannot be increased further without breaking the performance requirement.
The width of every variable is then reduced to the smallest value in this set; this
is similar to the uniform precision analysis in [87] with the difference that this is a
lower bound, and it is likely to break the error requirement. The number of bits
of the mantissa of each variable is then increased independently. The solution that
maximises the following equation is chosen.
∆performance
∆cost
If ∆performance is never greater than zero, the number of bits of every variable is
increased by 1 and the step repeated until the performance criteria are met.
The final step is to reduce each variable width independently.
∀i
{
pi > 0 reducedi = p0, p1, pi − 1, p2, ..., pn−1
otherwise reducedi = p0, p1, pi, p2, ..., pn−1
The reduction that causes the greatest decrease in cost is chosen; the step is repeated
until no further reduction can occur without breaking the performance requirement.
20  Chapter 2: Background
2.2.10 Guaranteeing Accuracy
Modifying the width of arithmetic operators requires a specified degree of accuracy
to be maintained. Lee et al. have developed an accuracy-guaranteed approach
to word-length optimisation [87] which takes modified C++ , ASC [97], as input.
Operator overloading detects parts of the source code to be analysed. All loops are
unrolled automatically as the code is parsed.
The range of each variable is propagated using affine arithmetic [117]. This
technique represents a range as a polynomial containing coefficients related to each
variable. This has the advantage of including correlations between variables, reducing
the range obtained with interval arithmetic [100]. This is important because range
can affect precision [102]. The approach gives a range within 7% of the range obtained
from simulating the design. To analyse precision, a compile-time error propagation is
adopted to produce accuracy-guaranteed solutions. This is similar to the technique
described in section 2.2.7 but differs because it makes use of affine arithmetic to
improve results if correlations exist. A variable is represented as follows:
errora = 2
−ta
where a is a correlation coefficient (section 2.2.2) and t depends on the number of
bits to truncate and the rounding mode.
Integer linear programming (ILP) is a method of finding the optimal solution
to a linear function, given a set of constraints. This method is employed, based
on [32], to determine the optimal set of word-lengths. The authors report that the
ILP formulation can take several hours to produce results. Although ILP produces
optimal results with regards to a cost model, an inaccurate cost model causes the
results obtained from the place and route tools to be suboptimal. This can often
lead to architecture-specific cost models.
Simulated annealing is used to optimise the precision of operators based on worst-
case error models — underflow and overflow are therefore avoided. An error function
is employed to guarantee that the error on the outputs is below a given threshold
and a cost function ensures that the area of the resulting circuit is minimised. The
authors show up to 26% improvement for small hardware kernels. The algorithm
execution time ranges from 1.9 seconds for a small polynomial approximation to 179
seconds for an 8×8 discrete cosine transformation. This approach may not prove
effective if entire applications are analysed as opposed to small kernels.
2.2 Arithmetic Analysis  21
2.2.11 Architecture Mapping
An important question regarding word-length analysis is: when should it take place?
Should it, for example, take place after the arithmetic operators have been mapped
to architectural resources or before. Constantinides et al. [31] highlight the difference
between adders and multipliers with regards to mapping operations to a resource;
performing two additions on a single adder is a simpler operation than performing
two multiplications on a single dedicated multiplier, for example. The authors then
formulate the problem as an assignment of colours to a graph, subject to a set of
constraints. It is combined with word-length optimisation using a heuristic [33].
Kum et al. [83] simulate designs to reduce the range and precision of variables. A
system is proposed in which the scheduling of operations to resources is performed
before the word-length optimisation. Signals are grouped together to reduce the
complexity of the problem, as in [46] — signals connected to the inputs and outputs
of delays and adders are grouped since they have similar word-lengths. Signals
connected to the inputs and outputs of multipliers are not grouped. Two approaches
are investigated: integer linear programming (ILP) and a heuristic based on list
scheduling which assigns operations from the largest word-length to the smallest.
To reduce the number of resources required, constant multipliers are converted into
adders and shift registers. The precision width of each variable is then reduced, one
bit at a time, until no further reduction can occur without breaking constraints. The
analysis:
1. Group signals.
2. Optimise word-lengths.
3. Schedule operations.
is modified such that the word-length optimisation is performed after the scheduling
of operations.
This means that an accurate, architectural cost model can be used. The authors
show that a combined word-length optimisation and high-level synthesis can produce
more optimal results than high-level synthesis performed after word-length analysis
by up to 15%. The disadvantage of this approach is that a new cost model may be
required for each device.
22  Chapter 2: Background
2.2.12 Right–Size
The number of mobile devices is increasing rapidly. As the size and complexity of
such devices increases, reducing the power consumption becomes more important.
Constantinides [30] shows that word-length optimisation can reduce power consump-
tion by looking at the sensitivity of variables to small errors. This approach employs
a tool based on Right–Size [29] to achieve a reduction in power of up to 98%, an
area reduction of up to 80% and an increase in clock frequency of up to 36% for least
mean square adaptive filters.
Right–Size, a methodology based on [34], uses a binary search to calculate a
uniform word-length. This word-length is then increased to search a larger portion
of the search space (given that after the initial increase, each word-length can only
be reduced). Each word-length is reduced until it cannot be reduced further without
breaking the error requirement. The word-length ultimately selected to be reduced
in the final solution is the one that provides the greatest cost improvement. This is
illustrated below.
∀i {reducedi = p0, p1, pi − j, ..., pn}
where j cannot be increased without breaking the error requirement. This process is
then repeated. The error function is therefore called multiple times to analyse the
affect of reducing each word-length. The error model introduces noise, with a given
variance, for every truncation which is then propagated from input to output.
Generic power models have been proposed to reduce power further. Consider the
model for an adder [95]:
Padder = Aadder × V 2 × f
where A is the area, V is the input voltage and f is the switching activity; a
similar model is adopted for multipliers. These models are designed to operate on
a high-level hardware description and are therefore flexible and fast (discussed in
more detail in section 2.5). The drawback is that they may not be as accurate as
architecture-specific models.
2.2.13 Application to Processors
Power can be characterised as static or dynamic. Static power, also known as leakage
power, is the power dissipated when the circuit is not changing state (switching).
Dynamic power is the component related to switching. Cao and Yasuura [22] reduce
2.3 Phase Analysis  23
operator widths in order to minimise leakage power, showing that word-length
analysis can be applied to processor design. The goal is to reduce the width of the
data-path of a soft-core processor using word-length analysis techniques. Valen-C [68]
(Variable Length C), a compiler developed with SUIF [127], is adopted because it
targets a variant of C which supports more flexible numeric formats. The precision
assigned to a variable may be increased to match the width of the bus but not
decreased. For this reason, the problem has a lower complexity because a uniform
word-length is sufficient.
SPICE (Simulation Program with Integrated Circuit Emphasis) simulations are
used on several different memory units to derive equations that estimate the power
of RAM and ROM with regards to the characteristics of the data stored. Power
reductions of up to 66% and 59% have been achieved for static and dynamic power
consumption respectively.
2.3 Phase Analysis
A single word-length analysis that caters for every phase of execution may overes-
timate the required size of a function at a given time if the scenario changes. This
section discusses research carried out to determine when a scenario has changed and
whether this should result in a modification of the circuit or application.
2.3.1 Word-Length Adaptation
Bondalapati and Prasanna [17] show that reconfiguring the circuit at run time,
reduces the time taken by up to 37%. The design is reconfigured at run time, so it
may not be applicable to application-specific integrated circuits (ASICs).
The first stage of the approach is to create a graph of precision against iteration
number showing how accuracy requirements change over time. Loops are targeted
because they are the most computationally intensive part of an application. The
variables are assigned a precision based on the type of operation, for every iteration
of the loop. If the loop is too complicated to analyse statically, it is profiled, leading
to less conservative results.
Given the overhead of reconfiguration compared to execution time, it may not
always be possible to reconfigure the circuit every time a precision changes. For this
reason, the optimal reconfiguration schedule is found. Five approaches are compared:
24  Chapter 2: Background
A uniform word-length is chosen for every iteration.
As above, but the number of loop iterations is known; the uniform word-length
can therefore be reduced.
The configuration resulting in the lowest execution time is chosen, but recon-
figuration time is not considered.
A higher execution time between reconfigurations with fewer reconfigurations.
A controller is added to analyse the precision at run-time.
The authors show that the strategies lead to reduced run time, with the first method
(675ms) being the slowest and the fifth method being the fastest (425ms). This
approach does not look at operators that produce results with an infinite precision
or other methods of reconfiguration.
2.3.2 Phase Characterisation
Phase analysis has previously been used in operating systems [39] to allow a program
to adapt to changing conditions. Working set signatures — a summary of the working
set that is significantly smaller — have been proposed [40] to characterise the different
phases of execution of a processor, such that software and hardware parameters can
be tuned to increase performance.
Isci et al. [69] analyse power phases for the SPEC 2000 benchmark suite. The
study is based on performance counters, such as the number of cache misses, and
control-flow information, such as the number of times a block of code is executed;
power is not estimated in this case. Intel Pin [92] gathers performance metrics
by adding code to a binary. Power phases are tracked by collecting cache access
rates and instruction counts: 15 performance counters in total. The accuracy of the
clustering is measured by comparing a sample power measurement within the phase
with the other values in the phase. Two approaches are tested:
Blocks that execute a large number of times are likely to consume more power.
Since power consumption can change based on data characteristics, additional
metrics are required.
Performance counters are used to estimate the power phases, for example,
cache misses.
2.3 Phase Analysis  25
Once a power measurement has been estimated, the sample must be added to a
cluster. Two methods are applied:
First pivot clustering. The first sample is assigned as a pivot. Subsequent
samples are compared against each pivot. The sample is added to the cluster
if the difference between sample and pivot is small enough, otherwise, a new
cluster is formed with the sample as the pivot. It is not possible to determine
the number of phases in advance. For this reason, agglomerative clustering is
looked at.
Agglomerative clustering. Starting with a set of clusters containing a single
sample, each pair is compared to find the best candidates to combine into a
single cluster.
Errors of 1.9%–7.1% were found for the counter based approach compared with
2.9%–11.7% for the block-based approach.
Styles and Luk [119] characterise phases as a set of branch probabilities (the
phase signature). Throughput improvements of up to 95.4% are obtained. There are
two components to the approach:
A method of generating configurations, one for each phase.
A system to change phase.
A set of counters and a queuing model determine the number of times a specific
branch is executed and therefore characterise the phase of execution. If a hardware
block is not used frequently, its area can be reduced [118]. The idea is that each
basic block — a section of source code containing no conditional branches or loops
— can produce results at a different rate. Given that each block may finish at a
different time, a token1 synchronises the computation and a FIFO is placed between
each block to buffer data. Branch frequencies are collected with an oﬄine run-time
analysis. A queuing model assuming steady state branch probabilities is employed to
calculate arrival rates at each internal node. This allows blocks that can have their
throughput reduced while still avoiding stalls in the pipeline to be identified.
1A token is an item of data that has no purpose other than to synchronise different components
of the circuit.
26  Chapter 2: Background
The approach results in area reductions of up to 27.5% on a Xilinx Virtex FPGA
for a given performance and a throughput increase of 3.2 times for a given area. The
queuing model is shown to have a relative error of less than 0.12 for video feature
extraction and progressive refinement radiosity (a lighting technique used in 3D
graphics applications).
2.4 High-Level Hardware Design
In this section, design tools enabling source code transformation and hardware design
are discussed.
2.4.1 Source Code Analysis and Transformation
The Stanford University Intermediate Form (SUIF) [127] is a compiler framework
capable of analysing and transforming Fortran, C and C++ source code. SUIF
consists of two core components:
A system to represent the different operators and annotations, such as for
loops and while loops, in an abstract way. This enables a variety of languages
to be analysed in the same way.
A method of transforming the representation.
Annotations in the form of #pragma statements help automate transformation and
enable different optimizations to be combined. The intermediate form uses both
high-level constructs and low-level operations. There are two reasons for this:
Working directly with the source code constructs of a language can make the
representation language-specific. The advantage of this approach is that all of
the information is maintained.
Working with low-level operations, for example, jump instructions, can make
the representation platform-specific.
This lower level representation can make abstractions in the source code difficult to
recognise, however, in some cases this can improve performance by up to 47% [11].
SUIF has been extended for a number of different research projects [68, 82, 104, 116] in
both hardware and software design. Gokhale et al. [51] use SUIF for stream-oriented
2.4 High-Level Hardware Design  27
hardware design based on the CSP [65] (Communicating Sequential Processes) model
of computation. The language extensions include a mixture of library calls and
#pragma annotations.
One of the key problems with SUIF is that it is difficult to optimise high-level
abstract source code because it is converted to a platform-specific representation.
ROSE [107] allows direct source-to-source transformation of C, C++ and Fortran.
This means that domain-specific abstractions can be preserved or transformed such
that they can be optimised by a generic compiler. The elimination of domain-specific
compilers, which are often not widely adopted because of their limited applicability,
can make development less expensive. Generic compilers have problems as well.
Domain-specific languages are often more efficient because abstractions can be
more fully optimised. Annotation languages have been proposed to circumvent the
problem [59].
Transformation has also been employed to boost performance, such as automatic
parallelism [91]. As well as being a useful tool for software development because of the
introduction of multi-core processors, it is also beneficial for hardware development.
Coupled with this, the ability to parse object-oriented software enables ROSE to
convert user defined types to built-in types, enabling language extensions. It can
therefore be used to develop hardware descriptions which have additional arithmetic
types and formats.
2.4.2 Arithmetic Design
Designing arithmetic units in hardware is a challenging task given that the inputs
to an operator may use multiple number systems, for example, floating-point or
fixed-point. Coupled with this, selecting the word-lengths of each operand is a
computationally intensive task [28]. Due to the complexity of producing efficient
arithmetic units, approaches have been proposed to simplify this task. Tsoi [121]
has produced CAST, Computer Arithmetic Synthesis Technology, which provides a
consistent high-level interface to create circuits. It supports several different number
systems such as fixed-point, floating-point and logarithmic which are efficient in
different situations. CAST has been applied to the N–body problem [120] using
statements to construct and connect objects, shown in figure 2.2.
PAM-Blox [98] enables circuits to be created in a similar way by connecting
C++ objects together, combining the flexibility of a high-level language with the
28  Chapter 2: Background
1 add_object = new Add_n("add1", n);
2 connect(mul_object ->P, add_object ->A);
Figure 2.2: The structure of a circuit modelled as objects in a high-level language (Com-
puter Arithmetic Synthesis Technology).
efficiency of low-level, customised arithmetic units. JHDL is a platform-independent
methodology using Java objects [67]. Each of these systems provides a uniform
method of constructing different functional units and connecting them together.
Functional languages have been used extensively to provide abstract methods of
designing hardware [3, 13, 56].
2.5 Modelling Power
Designing power-efficient circuits using a high-level language is a complex task [130]
because the circuit must be mapped to a specific architecture, possibly changing the
amount of power it consumes. Farrahi et al [47] show that mapping while minimising
power consumption is NP–Complete.
Modelling power consumption is important because the device may not always
be available to take measurements. Power consumption in FPGAs and ASICs can be
characterised as static or dynamic. Static power consumption is not related to the
level of computation on a device. It can be modelled with the following formula [20]:
Pstatic = V × Tech× n×D
where V is the supply voltage, Tech is a technology dependent parameter, n is the
number of transistors and D is a design dependent parameter. Dynamic power
consumption is the component of power related to signal transitions (switching) and
can be modelled as [80]:
P =
∑
r∈resources
CrV
2
r fr
where Cr, Vr and fr are the capacitance, voltage and operating frequency of resource
r, respectively. The clock frequency and the input data are important factors affecting
power consumption because they both affect the signal transition rate.
One of the most high-level methods of power estimation is to use statistics about
each variable [15]. The statistics — the mean, variance and correlation — are used
to calculate bit characteristics: the bit probability, bit transition rate and temporal
2.5 Modelling Power  29
correlation. The transition activity — the sum of the activity rate on each bit —
is shown to have a strong correlation with the power dissipated. The use of a 3D
lookup table has also been proposed [57] based on input signal probability, input
transition density and output signal transition density. The model is extended [58]
to take into account the spatial correlation between bits of the input, SCij.
SCij = P{xi ∧ xj = 1}
This model is extended [12] again by using a temporal correlation coefficient and
omitting the output signal transition density because it requires time-consuming
simulations. Coupled with this, the spatial correlation is updated to:
SCij = P{xi xnor xj = 1}
xnor is chosen because it takes into account variables that have matching bits. The
model is then modified based on the word-length [72]. Four characteristics of the
input signal are therefore analysed: word-length, input signal probability, input
signal transition density and a spacial correlation coefficient. It is found that the
signal probability has little effect on the power consumption because most modern
devices are sensitive to signal transitions.
Routing power is also important [26]. It is shown that the routing power is
directly related to the power consumed by the logic and the input signal activity.
Inter-component routing power is shown to be small compared with intra-component
routing power if outputs are registered to reduce glitches.
Word-length optimisation is a major source of power reduction. Abdul Gaffar et
al. [1] present an approach built on BitSize [2] to reduce dynamic power by over 10%
using models which calculate the power required by every element in the circuit as
follows:
P =
1
2
CV 2
[
lim
cycles→∞
n(cycles)
cycles
]
where C is the capacitance, V is the voltage, cycles is the number of clock cycles
and n is the number of signal transitions. Given that the capacitance is required, the
circuit must be placed and routed. The authors show that area-optimised circuits
will not always be the most optimal with regards to power. A modified cost function
produces an estimate of the logic and routing power for each component but not the
power used for routing between components. The Xilinx XPower power estimator
was chosen to evaluate the approach.
30  Chapter 2: Background
Due to the inaccuracies inherent in power estimation tools they are not used to
generate results in this thesis, however, the approach can be easily expanded to do
so.
2.6 Summary
It has been shown that mixing number systems can reduce area (section 2.2.1). Much
work has been done to create efficient hardware designs from a high-level description
(section 2.4). There is, however, a gap between these two approaches. Number systems
are either manually mixed without meeting any accuracy requirement or hardware is
automatically generated with a single number system. Chapter 3 addresses this gap
by using integer linear programming (ILP) to generate hardware circuits with the
optimal mapping of numeric representation to operators while meeting the accuracy
of a software application.
In section 2.2.10 it is shown that integer linear programming can also be used to
calculate the optimal width of each operator. This approach may not always produce
the optimal solution after the circuit has been routed because the cost functions do
not take into account the placement of each operator and their timing constraints.
Coupled with a high execution time (hours for a small number of variables such as a
degree–8 polynomial) it is impractical when producing hardware designs from large
software applications. Generating sub-optimal solutions may also be time-consuming.
Simulated annealing can take several minutes for small designs (such as an 8×8
DCT). For this reason the design may need to be partitioned in order to reduce the
time taken for the word-length analysis (chapter 4). As shown in section 2.2.3 it may
be beneficial to combine data gathered at compile time and run time to produce a
fast analysis that removes the problem of overflow regardless of input data.
Power estimation (section 2.5) is now an important tool to optimise hardware
devices [1, 20, 22]. Low-level power analyses are time-consuming [110] so high-level
analyses [15, 26, 72] are often employed. Given that these models can be inaccurate,
power is measured directly in this thesis.
Many of the proposed hardware design methods are static — they do not adapt
to changing conditions. Work in the area of phase analysis shows that splitting a
program into phases based on run-time conditions can reduce circuit area and increase
the maximum clock frequency (section 2.3). If phase analysis and power analysis
2.6 Summary  31
were combined, energy could be reduced. In this case, it is important to select the
optimal reconfiguration strategy to change phases. One such method is clock gating
— switching off parts of a hardware circuit to reduce power. Although clock gating
in reconfigurable hardware is not as efficient as in application-specific integrated
circuits (ASICs) [133], it can be used to save power. Power saving techniques are
only effective when they can be applied for long periods of time. For this reason, the
energy requirements of a system are often investigated [103, 114].
A summary of previous work is shown in tables 2.3 and 2.4. In this thesis run-
time phase analysis is combined with word-length analysis (chapter 4) to create a
variable-precision design to reduce energy (chapter 5).
32  Chapter 2: Background
A
p
p
ro
a
ch
D
e
v
e
lo
p
m
e
n
t
R
a
n
g
e
P
re
c
isio
n
C
o
m
p
ile
-tim
e
o
r
L
a
n
g
u
a
g
e
A
n
a
ly
sis
A
n
a
ly
sis
R
u
n
-tim
e
A
ffi
n
e
a
rith
m
etic
[45
]
-
A
ffi
n
e
arith
m
etic
A
ffi
n
e
arith
m
etic
C
om
p
ile-tim
e
(section
2
.2
.2
)
B
itw
ise
[1
16]
C
In
terval
N
on
e
C
om
p
ile-tim
e
(section
2
.2
.4
)
arith
m
etic
F
R
ID
G
E
[77
]
C
w
ith
In
terp
olation
In
terp
olation
C
om
p
ile-tim
e
an
d
(section
2
.2
.5
)
A
n
n
otation
s
R
u
n
-tim
e
B
itS
ize
[2]
M
o
d
ifi
ed
C
+
+
S
im
u
lation
A
u
tom
atic
R
u
n
-tim
e
(section
2
.2
.6
)
d
iff
eren
tiation
M
A
T
C
H
[60
]
M
A
T
L
A
B
In
terval
E
rror
p
rop
agation
C
om
p
ile-tim
e
(section
2
.2
.7
)
arith
m
etic
h
eu
ristic
E
rror
h
eu
ristic
[1
08]
M
A
T
L
A
B
In
terval
E
rror
p
rop
agation
C
om
p
ile-tim
e
an
d
(section
2
.2
.8
)
arith
m
etic
h
eu
ristic
R
u
n
-tim
e
C
ost
h
eu
ristic
[46]
M
o
d
ifi
ed
C
+
+
S
im
u
lation
S
im
u
lation
+
R
u
n
-tim
e
(section
2
.2
.9
)
h
eu
ristic
M
in
iB
it
[8
7
]
/
P
ow
erB
it
[1
]
M
o
d
ifi
ed
C
+
+
A
ffi
n
e
arith
m
etic
E
rror
p
rop
agation
C
om
p
ile-tim
e
(section
2
.2
.1
0
)
h
eu
ristic
S
ig
n
al
grou
p
in
g
[83
]
-
In
terval
S
im
u
lation
+
C
om
p
ile-tim
e
an
d
(section
2
.2
.1
1
)
arith
m
etic
h
eu
ristic
R
u
n
-tim
e
R
ig
h
t–
S
ize
[29]
S
im
u
lin
k
S
im
u
lation
S
tatistical
R
u
n
-tim
e
(section
2
.2
.1
2
)
an
aly
sis
P
ow
er
[95]
S
y
stem
C
S
im
u
lation
S
im
u
lation
+
R
u
n
-tim
e
(section
2
.2
.1
2
)
h
eu
ristic
L
eaka
ge
p
ow
er
[22
]
C
S
im
u
lation
S
im
u
lation
+
C
om
p
ile-tim
e
an
d
(section
2
.2
.1
3
)
h
eu
ristic
R
u
n
-tim
e
T
ab
le
2.3:
C
om
p
arison
of
th
e
d
iff
eren
t
a
p
p
roach
es
to
w
ord
-len
gth
an
aly
sis.
C
om
p
ile-tim
e
an
d
ru
n
-tim
e
are
u
sed
to
clarify
w
h
eth
er
an
a
p
p
ro
a
ch
m
ak
es
u
se
o
f
d
ata
ga
th
ered
a
t
ru
n
tim
e.
2.6 Summary  33
Approach Phase Target Analysis Static or
Analysis Dynamic
Styles and Branch probabilities Area and Queuing model Static
Luk [119] clock frequency
Bondalapati and Variable precision Run time Heuristic Static and
Prasanna [17] Dynamic
Isci et al. [69] Counters and Power Heuristic Static
Block frequency
Table 2.4: Comparison of the different approaches to phase analysis. The terms static and
dynamic are used to illustrate whether an approach can adapt to suit run-time conditions.
34  Chapter 2: Background
CHAPTER 3
Reducing Circuit Area using Multiple
Data Representations
Field-programmable gate arrays (FPGAs) and application-specific integrated circuits
support a wide variety of data formats, for example, floating-point and fixed-point.
Circuits typically contain floating-point units to closely match the functionality of a
software application or fixed-point units to to reduce the area and increase the clock
frequency if possible. Using multiple data formats in the same circuit may require
additional logic to convert between them. It may be the case that representing the
same type of data with different data formats results in a more efficient circuit. The
optimal selection of numerical representation and mapping of dedicated hardware
blocks to each operator is the problem analysed in this chapter.
Calculating the optimal representation of each operation in a data-flow graph is
a complex problem. Additionally, there are multiple architectural choices for each
operation and format. As an example, consider embedded multiplier blocks. They
can be used in conjunction with other resources on an FPGA to realise fixed-point
and floating-point multiplication but not logarithmic addition. This significantly
increases the size of the search space.
A software application may contain both floating-point (single and double preci-
sion) and integer operations. A hardware circuit can make use of several different
number systems with different characteristics. The following must be considered
when determining the architecture:
Number system — the representation of numbers, for example, floating-point,
fixed-point, logarithmic, residue etc.
Format and specification — the specific choice of number system, such as IEEE
floating-point.
36  Chapter 3: Reducing Circuit Area using Multiple Data Representations
Accuracy requirements — format conversion may introduce error which must
be reduced by increasing the width of operators to match the error of the
floating-point application.
Components — embedded multipliers, lookup tables (LUTs) etc.
Pipeline structure and latency — the more pipeline stages an operator has, the
higher the latency, the more flip-flops it will require and the higher the clock
frequency.
The novel aspects of the approach are:
1. The transformation of data-formats in a software application to enable it to be
efficiently transformed into a hardware design (section 3.2).
2. An integer linear programming (ILP) formulation to enable optimal solutions
to be found. This is compared with simulated annealing [79], an algorithm
that generates suboptimal solutions rapidly (section 3.3).
The benefits of the approach are shown with 8 benchmarks: a ray tracer, a B–splines
circuit, the GARCH(1,1) financial model, convolution, polynomial approximation,
complex multiplication, Gaussian blur and fast Fourier transform (section 3.4).
3.1 Problem Definition  37
3.1 Problem Definition
The problem is constructed in the form of a directed graph, a data-flow graph
representing the software application in this case. Each node in the graph represents
an arithmetic operator, for example, multiplication, or a larger component to calculate
more complex functions, for example, a processor. Each node has a set of inputs
and outputs in a given representation. Every node must have an architecture which
consists of a numeric representation, for example, floating-point, and a set of resources,
such as embedded multipliers and LUTs. For a set of representations, rep1, rep2,
rep3, ... and resources, r1, r2, r3, ..., an operator can be constructed from one of all
valid combinations as described by the developer in an architectural description.
The data representation of all of the operators in the graph must remain consistent.
This means that if there is a true dependency from operator x to operator y in the
graph, denoted dep(x, y), the representation of the specified input to operator y must
be the same as the representation of the output of operator x, otherwise, converters
must be added. Converters may have an area and resource overhead.
In this problem, it is important to distinguish between resources to be minimised
and those which must be constrained. The goal is to minimise the area, for example,
LUTs (on an FPGA), given that there is a limited number of dedicated resources,
for example, embedded multipliers. If the approach is being applied to an ASIC
design, the unit of area will be transistors as opposed to LUTs. Given function
units op(op, r), which specifies the number of resources required to construct operator
op with resource r, the following constraint must be met for a valid solution.
Table 3.1 shows a summary of the notation used to outline this problem.
Resource Constraint
The number of embedded resources in the system is defined as follows.
res(r) =
∑
op∈ops
units op(op, r)
where ops is the set of all operators. Since conversions between data representation
may make use of dedicated resources, the function units cv(x, y, r) is defined, which
given two operators and a resource, x, y and r respectively, returns the number of
units of that resource required for the conversion.
res cv(r) =
∑
x∈ops
∑
y∈ops
{
dep(x, y) ∧ repx 6= repy units cv(x, y, r)
otherwise 0
38  Chapter 3: Reducing Circuit Area using Multiple Data Representations
Notation Description
ri Resource i.
maxr The number of resource r available.
dedicated resources The set of all dedicated resources.
repi Representation of node i in the data-flow graph.
ops The set of all operators.
dep(x, y) A true dependency from node x to node y.
units op(op, r) The number of units of resource r required to construct oper-
ator op (which in the general case may be an adder, multiplier or
larger component containing multiple arithmetic operations).
units cv(x, y, r) The number of units of resource r required to convert the
format of the output operand of operator x to the format of
the input to operator y.
impi,j A binary variable representing operator i with architecture j.
implogic Architecture choices requiring no dedicated resources, for ex-
ample, a fixed-point adder constructed out of LUTs.
Vlogic The set of all operators that cannot be constructed out of ded-
icated resources.
mcostsi,j The number of embedded multipliers required to construct op-
erator i with representation j.
lcostsi,j The number of LUTs required to construct operator i with
representation j.
Table 3.1: A summary of the notation used in the data representation problem.
If the output of a resource is used as the input to multiple resources with the same
representation, only one converter is required. For simplicity, the equation above
assumes that multiple converters are used, although this is not a restriction of the
model. The resource constraint is therefore:
res constraint = ∀r ∈ dedicated resources
res(r) + res cv(r) ≤ maxr
where maxr is the total number of resource r on the device.
Objective Function
The goal is to constrain the number of dedicated resources and minimise the area
(although power could also be minimised or clock frequency maximised in the general
case). The objective function contains two components: the operator costs and the
converter costs:
minimise: res(LUTs) + res cv(LUTS)
subject to: res constraint
3.2 Methodology  39
C/C++ Program
Input data
Range Library
Profiling
(Section 3.2)
Architectural 
Description
ILP Solver
Simulated 
Annealing
Optimised 
Hardware Design
Optimisation (Section 3.3)
Figure 3.1: An outline of the methodology to optimise data representation. The solution
is generated using either integer linear programming (section 3.3.1) — a set of linear
constraints are constructed to represent the problem and then solved to find the optimal
solution — or simulated annealing (section 3.3.2) — the solution is not guaranteed to be
optimal but is generated more rapidly in most cases. A hardware design containing all
relevant arithmetic units and routing is then constructed. An architectural description is
required to ensure that any solution is valid for a given device.
LUTS may be replaced with flip-flops if clock frequency is being maximised.
3.2 Methodology
The methodology shown in figure 3.1 has three inputs, as well as the data-flow
graph to be optimised: the first input is simulation data, the second input is a
database of variable information, specifically relating to range, and the third input
is an architectural description of the device. The input software description contains
floating-point and integer types but does not store data with a fractional type
in integer variables. The output is a circuit design containing the optimal data
representation to minimise area (the model can also be extended to minimise power)
for a given device. The optimal representation could change if the device were to
change, hence the need for an architectural description.
Range library The range library maintains information about every range being
analysed by the profiling stage. Some information about precision is also stored
because as explained in section 3.2.1, range and precision are correlated.
Architectural description An illustration of an architectural description is given
in appendix A. The description contains information about the device, for example,
the number and type of dedicated resources. This is utilised by the optimisation
function which searches for a solution; a solution is invalid if the design cannot fit
40  Chapter 3: Reducing Circuit Area using Multiple Data Representations
on the device. It is also important to provide parameters about the availability of
device-dependent arithmetic logic. One example of this is floating-point addition,
which can be mapped to Xilinx DSP48 embedded DSP blocks (figure 3.6) but not
Xilinx Mult18×18 embedded multipliers, whereas multiplication can be mapped
to both. A device tends to contain one or the other. Cost functions may also be
included in the description.
Cost functions enable the optimisation algorithms to calculate the area and
number of resources required for a given operator size. These functions can be as
complicated and computationally expensive as needed without having any impact
on the time taken to perform the analysis. This is due to each operator having its
accuracy calculated before the design is optimised (section 3.2.1).
There are two components to the methodology:
Profiling A floating-point software application is profiled with ROSE [107] (sec-
tion 2.4). The portion of the application to be analysed is annotated with calls
to library functions providing information about each variable (explained in sec-
tion 3.2.1), such as the minimum and maximum range. A compile-time range analysis
will produce more conservative results but the accuracy constraint is guaranteed to
be satisfied. This is computationally expensive if tight bounds need to be obtained.
Given the range, the precision is calculated such that the accuracy is equivalent to
the floating-point application (section 3.2.1).
Optimisation Two algorithms are compared: integer linear programming to
generate the optimal solution, described in section 3.3.1, and simulated annealing to
generate near-optimal solutions, described in section 3.3.2. Sub-optimal solutions are
generated because finding the optimal solution for large applications is not practical.
As shown in section 3.4, simulated annealing produces optimal solutions for the
applications tested.
Several features of applications and data formats are exploited to optimise the
designs: the context of the operations within an application is exploited [52].
The output accuracy of the circuit must be guaranteed regardless of the
representation of its operators (section 3.2.1).
Fixed-point adders tend to be smaller than floating-point adders, whereas
floating-point multipliers tend to be smaller than fixed-point multipliers.
Floating-point operators have greater sensitivity around zero (section 3.2.2).
3.2 Methodology  41
 1e-10
 1e-09
 1e-08
 1e-07
 1e-06
 1e-05
 0.01  0.1  1  10  100
E r
r o
r
Input
float
fixed
(a) relative error
 1e-10
 1e-09
 1e-08
 1e-07
 1e-06
 1e-05
 0.0001
 0.01  0.1  1  10  100
E r
r o
r
Input
float
fixed
(b) absolute error
Figure 3.2: Relative and absolute accuracy of floating-point and fixed-point compared
using a 32-bit fixed-point format (8-bit range and 24-bit precision) and single precision
floating-point. Floating-point has a constant relative error because its precision increases
as its range decreases. This is contrasted with fixed-point which has a constant absolute
error because the number of fractional bits is constant as its range decreases.
Fixed-point operators can, in some cases, be more effectively mapped to
dedicated resources (section 3.2.3).
3.2.1 Accuracy
Number systems are often mixed to take advantage of the different aspects of each
one [8, 62, 131].
Floating-point:
Large dynamic range.
Accurate multiplication of values close to zero.
Fixed-point:
Worst-case absolute accuracy remains constant regardless of range (provided
that the width of the non-fractional component is large enough).
Small operators, for example, addition.
The disadvantage with these approaches is that they do not guarantee accuracy.
Simulations must be performed to calculate the degree to which operator widths
42  Chapter 3: Reducing Circuit Area using Multiple Data Representations
S Range Precision
Range Precision Fixed-Point
Floating-Point
x y1
32/64 bits
Figure 3.3: A comparison of floating-point and fixed-point accuracy. Floating-point vari-
ables are stored in a normalised form: −1s×1.m×2e where s is the sign bit, e the exponent
(ignoring the bias) and m the mantissa. The correlation between range and precision is
illustrated — as the range of the floating-point variable increases, its precision decreases.
should be reduced. Figure 3.2 shows how the relative and absolute error of floating-
point variables changes compared to fixed-point variables. The unsigned fixed-point
variables have an 8-bit range, 24-bit precision and support round-to-nearest; single
precision floating-point is assumed (8-bit exponent; 23-bit mantissa). Floating-point
maintains a constant relative accuracy because as the range increases, the precision
decreases. If the range is increased above 28 (for unsigned variables), the error
on the fixed-point variable increases rapidly because the variable overflows. The
degree to which the error increases depends on the method of overflow prevention,
for example, saturation arithmetic [35] or overflow flags. The worst case absolute
error of fixed-point variables remains constant because the precision width remains
constant regardless of range (provided that the number of bits to store the integer
part is large enough). A key feature of this methodology is that the accuracy of
the floating-point software application is maintained. This means that the size of
operators in different representations, for example, fixed-point must be calculated to
have an equal accuracy. Given the range of a floating-point variable, the precision can
be calculated because the size is known (single precision: 32-bits or double precision:
64-bits). This is shown in figure 3.3. The size of y is calculated based on the size
of x, determined by profiling. The range of the fixed-point variable may be smaller
or larger than the size of the floating-point exponent. The following information is
collected for each variable.
1. The maximum and minimum values.
2. The sign; if a variable is allowed to store negative values, the range may need
to be increased by 1 bit.
3.2 Methodology  43
3. The closest value to zero (both positive and negative).
4. The greatest precision a variable requires; a variable may not require an infinite
precision to accurately represent the values it can take.
Phase information [119] is added to this list if energy is to be minimised. It must be
determined whether reconfiguring the device will be more energy-efficient, given the
large power overhead that will be incurred for a short time.
Items 1 and 2 are used to calculate the range of a variable; items 3 and 4 are used
to calculate the required precision. Floating-point variables have a greater sensitivity
around zero because they are normalised to the following form: −1s×1.m×2e where
s is the sign bit, e the exponent and m the mantissa; the bias is ignored in this
case. The closest value to zero must therefore be determined in order to calculate
the size of the fixed-point operators; performed as in [50]. The precision required
for a fixed-point variable is calculated by adding the precision of the variable to
the number of zeros immediately following the binary point. The precision of the
variable is 24 bits for single precision floating-point unless the variable is guaranteed
to end in zeros, for example, if a known constant is assigned to the variable. In this
case, the precision can be reduced. It is sometimes possible to reduce the precision if
the minimum value a variable can take is greater than zero.
The advantage of this approach compared with accuracy-guaranteed [87] ap-
proaches is that error does not need to be propagated. This means that the simulated
annealing algorithm is more efficient because the cost only needs to be calculated
once for each node and type. The disadvantage in some cases is that the accuracy
can be too high. The accuracy must be guaranteed for every input in the simula-
tion: a fixed-point representation must produce the same result as the equivalent
floating-point function. Assuming that the smallest value input is 2−x, a fixed-point
variable has to be at least 23 + x bits wide (assuming single precision floating-point).
Not all of the bits may be used at a given instant; bits will therefore be wasted.
Reconfiguration is employed to reduce area further if worst-case values are rare.
Chapter 5 describes a method of determining how rare a situation has to be before
reconfiguring the entire chip becomes more efficient.
3.2.2 Floating-Point Hardware
Adders
Figure 3.4(a) shows the structure of a floating-point adder. Before floating-point
numbers can be added, the binary point of one number may need to be moved such
44  Chapter 3: Reducing Circuit Area using Multiple Data Representations
Compare and select
        Exponent                Mantissa
Align (Barrel Shifter)
Add
Normalise
Round
(a) adder
Multiply
Normalise
Round
Add + Bias
        Exponent                Mantissa
(b) multiplier
Figure 3.4: Simplified structure of a floating-point adder and multiplier. A floating-
point multiplier is similar to a fixed-point multiplier, but it contains logic to normalise
the variables and an adder to calculate the exponent. Floating-point addition is more
complex, requiring a costly barrel shifter to align the variables.
that the binary point of both numbers is in the same position. This requires a
barrel shifter — a shifter, which given as input the number of bits to shift, shifts
the value by the required number of bits. A barrel shifter is a large component
requiring approximately nlog2(n) multiplexers, where n is the width of the shifter.
Comparators are combined with the shifter to improve efficiency.
Logic is required to find the leading 1 in the result. Once found, the result can
be shifted until the 1 is one bit to the left of the most significant bit (since the first 1
is not stored in a normalised floating-point variable). Fixed-point addition requires
none of this overhead because it is not normalised. It can therefore be significantly
smaller.
A floating-point adder uses 573 LUTs on the Xilinx Virtex 4 LX200, synthesised
with Coregen 10.1. The size of a fixed-point adder realised on the same device can
be estimated by taking the maximum of the two input widths. For the adders to
perform operations of equal accuracy, the input range must be known due to the
accuracy loss caused by aligning the two values. For this reason, an exact value is
not specified for the size of the fixed-point adder.
Multipliers
Figure 3.4(b) shows the structure of a floating-point multiplier. Floating-point
multipliers are larger than fixed-point multipliers of the same width because the
variables must be normalised and the exponents added. Floating-point multipliers
become more efficient as the values input approach zero because the leading zeros do
3.2 Methodology  45
 3000
 3500
 4000
 4500
 5000
 5500
 6000
 0  5  10  15  20  25  30  35  40
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
ilp-xc2vp30
ilp-xc3s500e
ilp-xc4vlx200
Figure 3.5: Area (LUTs) of the B–splines design with multiple number systems and a vary-
ing number of embedded multipliers on the Virtex 4 XC4VLX200, Spartan 3 XC3S500E
and Virtex II XC2VP30. The Virtex 4 design requires fewer LUTs if embedded DSP
blocks are available (compared with the Virtex II and Spartan designs) because Virtex 4
devices contain a 48-bit adder; the embedded multipliers on the Virtex II and Spartan
devices do not. This enables the embedded multipliers to be connected together with a
reduced amount of additional logic.
not have to be stored due to the exponent. Floating-point multipliers do not require
barrel shifters or comparators and are therefore generally more efficient than their
fixed-point equivalent (the exponent addition requiring few LUTs).
A floating-point multiplier uses 631 LUTs on the Xilinx Virtex 4 LX200, synthes-
ised with Coregen 10.1. The size of a fixed-point multiplier realised on the same
device can be estimated by multiplying the two input sizes (this can be reduced
in some cases depending on the architecture and multiplication algorithm). As
explained above, for the multipliers to perform operations of equal accuracy, the
input ranges must be known. For this reason, an exact value is not specified for the
size of the fixed-point multiplier.
3.2.3 Embedded Hardware
On current devices, fixed-point units tend to utilise dedicated resources more ef-
fectively than floating-point units due to the additional components required to
perform floating-point arithmetic. The floating-point adder, for example, contains
shift operations that cannot be mapped to dedicated carry chains. Floating-point
adders can be mapped to dedicated multiply-add blocks but may underutilise the
resource. The device chosen will therefore depend on the format of the arithmetic.
Estimating the resources required by a circuit with a high-level model has the
advantage of being architecture-independent. Including some architectural charac-
teristics in the model, however, can improve the accuracy. Figure 3.6 shows the
46  Chapter 3: Reducing Circuit Area using Multiple Data Representations
D Q
D Q
D Q D Q+×
18
18
36
48
Figure 3.6: Simplified diagram of a DSP48 on Virtex 4 FPGAs. The additional adder is
a key feature in reducing logic, enabling multiply-add functions commonly used in digital
signal processing applications to be realised efficiently.
architecture of a DSP48, a component on many Virtex 4 FPGAs containing addition
logic that may be used to connect small multipliers together to form larger ones,
as well as being used for other purposes. This resource can act as a multiplier,
adder, multiply-accumulate and shifter commonly found in fixed and floating-point
arithmetic. This can be contrasted with the 18-bit multipliers found on Virtex II
devices which contain no additional adder. Figure 3.5 illustrates the effect this has.
Constructing a uniform cubic B–splines [71] circuit, applied in some image warping
applications, out of DSP48 embedded blocks is often smaller than constructing it
out of embedded multipliers. The difference in area is a result of large multipliers
requiring several dedicated multipliers to be connected together. This extra logic can
either be mapped to the dedicated resource, or lookup tables (LUTs). Due to large
differences between the embedded devices, the cost models are architecture-specific
to a degree.
The results show that circuits combining different numerical representations
utilise the available components on the embedded devices more fully. This approach
can be used to select a device based on the number and type of embedded resource,
lowering the cost of a product. It can also be used to design new devices with a
greater performance, extending new architectures that focus on a single number
system [63]. Deciding which dedicated blocks to add to an FPGA is not easy. There
is a trade-off between the functionality of a device and its cost. The more generic the
resources are, the lower the cost but the less efficient the FPGA is at solving domain-
specific problems. Ho et al. [64] have designed FPGAs specifically for floating-point
applications by replacing dedicated fixed-point multiply-add logic, commonly found
in current FPGAs, with floating-point logic. Adder blocks are placed after the
output of the multiplier blocks to reduce routing for circuits containing multiply-add
functions. Area improvements of 18 times and delay reductions of 2.5 times are
3.2 Methodology  47
shown using the proposed hybrid FPGA; this is extended in section 3.5.
Floating-point addition can be mapped to dedicated resources. A single-precision
floating-point addition realised on a Xilinx Virtex 4 LX200 requires 573 LUTs; the
use of DSP48 embedded DSP blocks, which contain an 18×18-bit multiplier and
a 48-bit adder, results in an area utilisation of 342 LUTs and 4 DSP48 blocks. If
a floating-point addition is the only function using a set of DSP48 blocks, only
a fraction of the hardware is utilised. Section 3.4 shows that, in some cases, the
smallest circuit may not produce the optimal solution with regards to raw hardware
area because the number of embedded resources used to achieve this reduction is so
high.
3.2.4 Number Systems
The selection of number system depends on the operators in the data-flow graph and
how they are clustered. It has been shown that an N–Body problem can be solved
efficiently with a logarithmic number system [120]. If a large number of multipliers
are grouped together, a logarithmic number system can be adopted to reduce area.
The logarithmic number system has the following properties, given that a = log2(A)
and b = log2(B):
A×B = 2a+b
A+B = 2b+log2(2
a−b+1)
√
A = 2
a
2
Multiplication and square root are low cost functions (square root being replaced by
a 1 bit shift) whereas addition is more complex. To use a logarithmic number system
effectively, clusters of multipliers have to be located. Given that logarithmic addition
and subtraction cannot be mapped as effectively to dedicated resources, it is less
common on current FPGAs. Consider part of a ray tracer (figure B.1). Although
the square root will be more efficient, it is connected to several addition operations
that negate any area saving.
Another number system that is often used on FPGAs is the residue number
system which breaks large values up into a set of smaller ones, sometimes making
a function more area-efficient. Operations are performed with modulo arithmetic
which means that a set of moduli, m, is required1. To represent variable v, another
1Each value in set m must be selected carefully to ensure that the values represented are unique.
48  Chapter 3: Reducing Circuit Area using Multiple Data Representations
set, x, is created:
xi = v modulo mi
Comparison is often more complicated than it is in the other number systems
discussed. Operations are performed as follows.
A×B = (ai × bi) mod mi
A+B = (ai + bi) mod mi
where A and B are represented as sets (as above) and ai is the element of A at
position i. Selection of the moduli set is important, affecting the size of the operation
and conversion. Scaling is also important to keep the position of the binary point
in the correct position. It can be so expensive that variables may be converted
to a different representation and then shifted [48]. Although some operators are
much larger, it is shown that multipliers greater than 10 bits are smaller; below this
width the overhead makes the operation less efficient. Floating-point and fixed-point
are selected in preference to such numerical systems because current devices have
better support for them, containing dedicated resources for the functions requiring
the greatest area. As well as this, inputs and outputs of a circuit commonly have
floating-point or fixed-point representations depending on the type of values being
handled.
To extend this methodology, a cost function is required for each operator and
a set of constraints which, in the case of a logarithmic multiply/divide, is simply
reuse of the fixed-point addition/subtraction cost function. In other cases, function
approximation [88] can be used to estimate the cost. Consider logarithmic addition
and subtraction [49]; the cost may be estimated by breaking log2(2
x ± 1) into
components. Extending the approach to support a new device requires a set of device
constraints; an example is shown in appendix A. The constraints specify:
The type of each embedded resource on the device.
The number of each resource, if limited.
Additional restrictions could be added if required, for example, a utilisation constraint.
In section 3.4 it is shown that reducing the number of LUTs by a small amount can
sometimes require a large number of embedded DSP blocks. This may be undesirable
because these DSP blocks may be more fully utilised in a different part of the design
3.3 Optimisation  49
or a different device with fewer DSP blocks could be chosen to reduce the production
cost.
Although ASICs do not contain embedded resources, an architectural description
could still be required. A chip may contain processing elements that are available
to more than one hardware component. Such sharing will require the addition of a
scheduler [37] and require the input data to be converted into a specified format, as
is the case for embedded FPGA resources.
3.3 Optimisation
It is important to distinguish between optimal and near-optimal solutions to the
problem described in section 3.1. Near-optimal solutions are generated because as
the size of the problem increases, it becomes impractical to find the optimal solution.
In practice, only one of the following algorithms will be chosen.
3.3.1 Integer Linear Programming
The optimal solution is found using integer linear programming (ILP). In order to
solve a problem in this way it must first be transformed into a set of linear equations;
this section describes the approach. Every node in the data-flow graph must have
an architecture (as described in section 3.1): the Cartesian product of resources
{LUTs, embedded multipliers} and representations {fixed-point, floating-point}.
Although it may be possible to realise part of an operator in LUTs and part in an
embedded resource, this is not done here: the number of dedicated resources required
to fully construct the operator are used. For example, a 32-bit multiplier may be
constructed with one 18-bit multiplier and a large number of LUTs. This is not done
because some tools do not support the mapping of an arbitrary number of embedded
resources to an operator of arbitrary size: 4 embedded multipliers are used; any
remaining functionality uses LUTs. The model could be extended to cater for this
by adding more numerical representations. The problem would become significantly
more complex. It would take longer to solve and only improve results if there were
limited resources available because they would be more fully utilised.
The Cartesian product of resources and representations gives the set of architecture
choices. Each node must have a single architecture, therefore the following constraint
is added.
∀i
t−1∑
j=0
impi,j = 1 (3.1)
50  Chapter 3: Reducing Circuit Area using Multiple Data Representations
 0
 0.5
 1
 1.5
 2
 2.5
 3
 0  5  10  15  20  25  30  35  40
T i
m
e  
[ s ]
Dedicated Functional Blocks
ilp-adddsp
ilp-addlut
Figure 3.7: Increasing the number of constraints on the ILP model can reduce the run
time because the problem is simplified. If floating-point addition is allowed to be mapped
to dedicated resources, the algorithm runs up to 4 times slower for a B–splines design.
where impi,j is a binary variable set to 1 if node i has architecture j and t is the
number of possible architectures — 4 in the experiments (fixed-point using LUTs,
floating-point using embedded multipliers etc.). If it could be guaranteed that
dedicated resources would be best mapped to specific operators, such as costly
multipliers, the constraint could be modified by fixing or limiting the type of a node.
This may occur if there are limited resources available.
Not all nodes can use dedicated resources (they cannot be given an architecture
requiring dedicated resources), for example, square root. Two sets are defined.
The first is a set of architecture choices, implogic, that do not require dedicated
resources, for example, a fixed-point operator constructed from LUTs. The second
set, Vlogic, contains operators that cannot be constructed with dedicated resources.
The following constraint is added:
∀i ((operatori ∈ Vlogic) ∧ (x /∈ implogic)) impi,x = 0 (3.2)
This constraint is added if the dedicated resources on a device cannot be used to
reduce the area of a floating-point adder. As shown in figure 3.7, adding such a
constraint can reduce the run time because the problem is simplified. A similar
constraint is also added if a core has not been developed for a given number system
or device. The number of constraints could be reduced. If equation 3.2 were removed,
the type constraint (equation 3.1) and objective function (equation 3.4) would be
modified to include them. For clarity, all constraints are included.
The number of embedded multipliers is often limited. If this is not the case, the
following constraint is not added. An array, mcosts, is constructed, each element
3.3 Optimisation  51
specifies the number of embedded multipliers required for a given node, with a given
representation.
n−1∑
i=0
∑
j /∈implogic
(impi,j ×mcostsi,j) ≤ maxmults (3.3)
where n is the number of operators in the data-flow graph and maxmults is the
number of dedicated multipliers available.
The objective function is composed of two parts. The first part relates to the
number of resources utilised for each node.
n−1∑
i=0
t−1∑
j=0
(impi,j × lcostsi,j) (3.4)
where lcostsi,j gives the cost in LUTs of node i with representation j (the number
of flip-flops could be used if optimising clock frequency as opposed to area). Even
if an operator is constructed from embedded multipliers, it may still use LUTs as
well, so every architecture is included in the summation. The number of flip-flops
(calculated with another cost function) is also analysed because the structure of the
pipeline has more of an impact on the number of flip-flops than LUTs.
The second part of the objective function relates to the conversion costs. If a
multiplier has a fixed-point input and a floating-point input, one of the inputs must
have its representation converted.
∑
∀i∀j

dep(i, j) (impi,fixed × impj,float × fixed to float(i, j))+
(impi,fixed × impj,float mult × fixed to float(i, j))+
...
(impi,float mult × impj,fixed × float to fixed(i, j))+
...
otherwise 0
(3.5)
where dep(i, j) shows that there exists a true dependency from node i to node j;
fixed and float represent fixed-point and floating-point representations using LUTs
and float mult represents a floating-point representation using dedicated multipliers;
fixed to float(i, j) represents the number of LUTs required to construct a fixed-
point to floating-point converter. No conversion cost is added if the nodes have the
same representation. A similar constraint can be added to ensure that converters
are only added once in the case that an operator has multiple outputs with the same
52  Chapter 3: Reducing Circuit Area using Multiple Data Representations
× ×
+
1
c[0] c[1]in[0] in[1]
3
2
34 34 34
40 40
40
Figure 3.8: An example data-flow graph to illustrate the data representation problem as a
set of linear constraints. The width of inputs and intermediate variables may be different
in each number system. In this example, it is assumed that the input data-flow graph
uses a single representation, single precision floating-point. The width of each variable in
different number systems depends on the input data supplied — 34 and 40 bits in fixed-
point in this case. A set of constraints is constructed and then solved to find the optimal
number system and set of dedicated resources for each node.
Representation operator1 operator2 operator3
fixed 0 0 1
float 0 1 0
fixed (embedded resources) 0 0 x
float (embedded resources) 1 0 x
Table 3.2: One possible solution to the example given. An x represents an invalid archi-
tecture; in this case, addition cannot be mapped to embedded multipliers.
representation. The two components of the objective function (equations 3.4 and 3.5)
are added and minimised to find the optimal solution, subject to the constraints
(equations 3.1, 3.2 and 3.3).
Example
Figure 3.8 shows part of a dot product which will be used to illustrate the equations
outlined above. For this example, it is assumed that the circuit is realised on
a Virtex II device. As explained in section 3.2.3, the architecture of the device
affects the solution generated — a Virtex II device contains embedded multipliers
with no additional adder logic. Constraints are added to restrict the search space
(equation 3.2):
impoperator3,fixed mult + impoperator3,f loat mult = 0 (3.6)
indicated by an x in the example solution shown in table 3.2. The two components
of equation 3.6 are indices into table 3.2. Pointwise multiplication and summation
3.3 Optimisation  53
are then used to calculate the cost of the operators, as in equation 3.4.
2∑
i=0
3∑
j=0
(impi,j × lcostsi,j)
Cost functions are discussed briefly in appendix A. The resource constraint is
calculated in a similar way. Given that the widths of the multipliers are 34 bits, 4
embedded multipliers will be required whether using floating-point or fixed-point.
Although the multipliers are 18 bits wide, the sign bit in all but the most significant
bit of each variable is wasted by chaining multipliers together. It may therefore be
beneficial to provide more unsigned multipliers than signed if large multiplication is
frequently required. If the device contained fewer than 4 embedded multipliers, the
solution shown in table 3.2 would be invalid assuming the same architecture as a
Virtex II device.
So far, additional hardware to convert between data formats has not been con-
sidered. From the data-flow graph, it can be seen that conversion cores may need
to be added. Any one of the true dependencies shown in figure 3.8 could result in
additional logic being synthesised. The size of such hardware and whether it is added
depends on the representation and width of each operand: the larger the bus width,
the more costly a conversion core will be. For the solution illustrated (equation 3.5):
float to fixed(operator1, operator3) + float to fixed(operator2, operator3)
where float to fixed(operator1, operator3) and float to fixed(operator2, operator3)
are the area costs of floating-point to fixed-point conversion cores with a width of 40
bits (figure 3.8). If the graph illustrated were part of a large data-flow graph, the
solution would not necessarily be optimal. Consider a circuit in which the output
has to be converted back to floating-point. A floating-point addition is significantly
smaller than three 40-bit conversion cores and a fixed-point addition (561 LUTs
compared to 927 LUTS).
Shared Representation
Load and store operations used on the same variable must have the same representa-
tion because a variable written to memory must be read back in the same format. It
may be possible to store data of different representations in the same memory but
additional information would have to be stored to distinguish each one. As well as
54  Chapter 3: Reducing Circuit Area using Multiple Data Representations
load and store operations, input and output operators are also included to indicate
variables that have a fixed format; the problem is therefore solved more quickly. The
following constraint is added to the ILP model:
∀i(impx,i = impy,i)
where x and y are the respective node positions.
To illustrate this constraint, consider the GARCH (generalised autoregressive
conditional heteroskedasticity) financial model [38], one of the benchmarks discussed
in section 3.4. The first thing to note is that the benchmark uses a random number
generator. Only floating-point cores are available so this node in the data-flow graph
has a fixed architecture (the core cannot be mapped to dedicated resources). Coupled
with this, the inputs and outputs have the same type.
∀i(impnodeσ ,i = impnodeσ′ ,i)
where nodex is the index of node x in the table of constraints and σ is one of the
parameters.
Shared Resources and Bandwidth
Some resources may be shared to reduce area. A design containing shared resources
may produce results at a slower rate but will be smaller. Shared resources must
have the same architecture and must be the same size, the maximum size of the
two operations (which may result in a greater accuracy than required). A shared
architecture constraint is added (section 3.3.1). When calculating the cost of the
design, shared nodes are only added once. Additional routing logic may be required
but for simplicity, it is not included here. The equation being minimised is altered
by summing over a subset of the nodes. Given two shared nodes, si and sj:
n−1∑
i=0
t−1∑
j=0
i 6= si (impi,j × lcostsi,j)
A similar constraint is added when calculating the number of embedded resources;
they are only added once. The ability to include shared resources in a hardware
description allows automatic pipeline generation [37] to reduce the area of a circuit
when resources are limited. This is done by selecting operators to share while meeting
strict latency constraints.
3.3 Optimisation  55
Process 1 Process 2
Process 3
a
b c
Figure 3.9: Generalisation of the data representation problem. Operators of a data-flow
graph have been replaced with processes. Bandwidth reduction is a key problem here and
may override the area cost; block floating-point and dual fixed-point [44] will be important,
particularly if the input stream is compressed.
Bandwidth constraints are calculated by looking at the size of operators in all of
the different representations. Only those that meet the bandwidth constraint are
considered when solving the problem. This may affect every operator connected to
the output and is a possible use of block floating-point (containing one exponent for
many variables) and dual fixed-point which both have a larger dynamic range than
fixed-point.
The methodology outlined can be applied to cyclic graphs and is not restricted
to simple functions such as addition and multiplication. In large systems, devices
may be connected together (processes in figure 3.9) making bandwidth reduction
more important. The goal is to reduce widths a, b and c, increasing the throughput.
The advantage of reconfigurable hardware is that changing representation need not
affect throughput.
Multiple Operator Embedded Devices
In the experiments, it is assumed that each embedded device is used for a single
operation. It is common for embedded multipliers to contain logic to accumulate
variables (for example, the Xilinx DSP48 shown in figure 3.6). This extra addition is
commonly used to connect multipliers together. For this reason, it is unlikely that
the addition hardware will be used for a different operation (although this may be
possible, for example, if the multiplication is smaller than 18 bits and can fit on a
single embedded block).
To extend the model, the embedded blocks should be broken down into their
constituent components. This makes the problem more complex because there are
more operators to choose from. Given that there are a limited number of inputs
56  Chapter 3: Reducing Circuit Area using Multiple Data Representations
to each component, a new constraint will have to be added. Consider a multiply-
accumulate block. If the multiplier is utilised, it may not be possible for the adder
to be used with two independent inputs.
3.3.2 Simulated Annealing
First, a fixed-point or floating-point number system is selected to reduce the cost.
Since the cost of one is usually significantly smaller than the cost of the other, for
example, the benchmark shown in figure 3.5, a solution using random data formats
is not generated. To ensure that valid solutions are generated, all constraints are
updated when the architecture of a node is modified.
Algorithm Run Time
To provide a faster alternative to ILP, which often runs too slowly for entire ap-
plications, simulated annealing [79] with a geometric cooling schedule is adopted
to incorporate the constraints. Solutions that the algorithm generates must be
checked for correctness. When a constraint is not met, the algorithm can slow
down significantly because a new solution must be generated and checked before
progress can be made. To reduce the time taken to run the algorithm, specific
architectural characteristics are checked. Invalid states are never generated, for
example, floating-point addition operations can be mapped to DSP48 blocks but not
Mult18×18 blocks.
Suboptimal Solutions
The number of resources that an operation requires is predicted without looking at
other operators in the data-flow graph; routing overheads are disregarded. This is
done because routing overheads depend on the algorithm used to place and route the
circuit and the architecture of the device, for example, the location of the embedded
resources. These overheads, caused by the placement of components, result in small
inaccuracies in the estimates. It may be the case that simulated annealing, although
not producing the optimal result with regards to the high-level cost estimation,
produces hardware designs that are the same size or smaller. As shown in section 3.4,
simulated annealing produces near-optimal solutions.
Simulated annealing provides a way of generating near optimal solutions with a
greatly reduced run time. In this approach, the algorithm is run until no change can
be seen for a given number of iterations. This ensures that the solution obtained is
as close to optimal as possible without taking an excessive amount of time.
3.4 Results  57
 4000
 6000
 8000
 10000
 12000
 14000
 0  10  20  30  40  50  60
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
ilp-fixed
ilp-float
sa-mixed
ilp-mixed
(a) area
 1e-06
 0.0001
 0.01
 1
 100
 10000
 0  10  20  30  40  50  60
T i
m
e  
[ s ]
Dedicated Functional Blocks
ilp-mixed
sa-mixed
ilp-fixed
ilp-float
(b) algorithm run time
Figure 3.10: Area (LUTs) and algorithm run time for a convolution (9 multipliers, 8
adders) with multiple number systems and a varying number of embedded multipliers on
the Virtex 4 LX200. Mixing numerical representations (ilp-mixed/sa-mixed) is shown to
produce a 15% area reduction. Increasing the number of embedded resources available on
the device does not always warrant the reduction in area (28 DSP48 blocks for 223 LUTs).
The reduction in the time taken to run the algorithm is also shown: 50 seconds to find the
optimal solution, ilp-mixed, compared with 1 second for a near-optimal solution, sa-mixed.
3.4 Results
The following designs and cores were synthesised with Xilinx ISE 10.1. Four al-
gorithms are compared: fixed-point with ILP (a solution generated with ILP2 in
which every operator has a fixed-point type unless constrained), all floating-point
with ILP, mixed numerical representations with ILP and mixed representations with
simulated annealing (section 3.3.2).
3.4.1 Convolution
Figure 3.10(a) shows how the area of a floating-point convolution (9 multipliers, 8
adders) changes as the number of dedicated functional units3 changes on a Virtex 4
LX200. Constraints ensure that all inputs have a floating-point representation and
that floating-point addition cannot be mapped to embedded multipliers but can be
mapped to DSP48s. Although rounding logic has not been included in the fixed-point
units, this requires a small amount of additional hardware, not significantly affecting
the results.
2The set of linear equations is created (as specified in section 3.3.1) and then solved with ILOG
CPLEX version 9.
3On the Virtex 4 device, the dedicated functional units are embedded DSP48 blocks containing
an adder as well as an 18×18-bit multiplier; on the Virtex II device, they are embedded Mult18×18s
which contain no additional adder.
58  Chapter 3: Reducing Circuit Area using Multiple Data Representations
 7000
 8000
 9000
 10000
 11000
 12000
 13000
 0  10  20  30  40  50  60
A r
e a
 [ F
l i p
- F
l o p
s ]
Dedicated Functional Blocks
ilp-float
ilp-float-dspadd
sa-mixed
ilp-mixed
Figure 3.11: Area (flip-flops) for the floating-point convolution with multiple number
systems and a varying number of embedded multipliers on the Virtex 4 LX200. Given
that the number of LUTs is equal to the number of flip-flops on this device, reducing both
is important. Adders are moved onto embedded blocks (if supported, ilp-float-dspadd) if
there are more than 36 available, resulting in a smaller reduction because of poor resource
utilisation.
There is a 15% improvement in area over the original floating-point design (42%
over a fixed-point design) if no DSP48s are used. An unlimited supply of DSP48s
yields no improvement, however, in the case that 36 DSP48s are utilised the area
of the floating-point design is reduced by 27% with the same number of dedicated
resources. To further reduce the area, 28 additional DSP48s are required; the area
decreases by 223 LUTs. Embedded multipliers give a 22% improvement because
floating-point addition cannot be mapped to the dedicated resources.
The time taken4 to execute each algorithm is shown in figure 3.10(b). Simulated
annealing is compared with ILP which is computationally more expensive. Figure 3.10
shows that simulated annealing generates near-optimal solutions (ILP producing the
optimal) for a large reduction in run time. Simulated annealing may occasionally
produce a better solution because after the circuit has been placed and routed, the
area estimate loses some of its accuracy; this is partly due to routing overheads.
The run time of the simulated annealing algorithm is almost constant although it
will increase as the number of variables increases. It occasionally runs more slowly
than the ILP solver because the algorithm must ensure that the search space has
been explored sufficiently, whereas the ILP solver can guarantee that the optimal
solution has been found and terminate immediately. The ILP model takes up to 50
seconds. Utilising embedded multipliers as opposed to DSP48 blocks results in a
drop in the time taken to run the algorithm from 50 seconds to 15 seconds because
4All results were obtained using an Intel Core2 Duo 3.00GHz processor with 4GB RAM.
3.4 Results  59
the additional constraint simplifies the problem. The shape of figure 3.10(b) is due
to the constraints enforced on the design. As the number of DSP48s is reduced, the
problem becomes less complex because there are fewer possible solutions.
It is important that the performance is not reduced as a result of optimising
circuit area: a bit-serial design will be smaller but will run significantly more
slowly. The optimal number of pipeline stages is selected for each functional unit as
determined by the tools which try to increase the clock frequency while reducing
the flip-flop utilisation5. The fixed-point designs could operate at a higher clock-
frequency, however, the number of flip-flops is already significantly higher than
the other designs. Figure 3.11 shows how the number of flip-flops is reduced by
mixing numerical representations despite having optimised the number of LUTs;
this will not always be the case so additional cost functions must be created. The
designs operate at 256MHz and vary by a maximum of 10MHz. The maximum clock
frequency is calculated by starting from a timing constraint that cannot be met and
gradually increasing the clock period by 0.1ns until the timing constraints are met.
The smallest clock period is then taken. The frequency is calculated this way to
avoid the tools prematurely terminating the routing algorithm.
3.4.2 Financial Modelling
Financial models, for example, the generalised autoregressive conditional heteroske-
dasticity (GARCH) model [38] require a high degree of accuracy leading to a large
fixed-point architecture. The benchmark uses a random number generator. Given
that this core is not available for different devices, the node is treated as an input
with a floating-point type (since the core cannot be mapped to dedicated resources).
Coupled with this, the inputs and outputs must have the same type. These con-
straints have been added to the ILP solver and simulated annealing algorithm (see
section 3.3.1), simplifying the problem slightly. Figure 3.12 shows that there is a
14% area reduction if no DSP48s are used and an 11% reduction if an unlimited
number is available (29% if embedded multipliers are used in place of DSP48s).
3.4.3 Image Processing and Ray Tracing
Adopting a floating-point number system often results in more efficient scientific
and financial applications. This improvement is estimated to be almost 50% for
5All floating-point units and conversion cores use the maximum available latency. The number
of pipeline stages for a fixed-point unit is selected based on the clock frequency of the floating-point
design.
60  Chapter 3: Reducing Circuit Area using Multiple Data Representations
 2000
 2500
 3000
 3500
 4000
 4500
 5000
 5500
 6000
 6500
 0  5  10  15  20  25  30  35  40
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
ilp-float
sa-mixed
ilp-mixed
Figure 3.12: Area (LUTs) of the generalised autoregressive conditional heteroskedasticity
(GARCH) financial model on the Virtex 4 LX200 with multiple number systems and a
varying number of embedded multipliers. Despite the large area of a fixed-point architec-
ture due to the high accuracy requirement, fixed-point resources are still able to reduce
the area of a floating-point design.
 5000
 6000
 7000
 8000
 9000
 10000
 11000
 12000
 0  10  20  30  40  50  60  70
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
ilp-float
ilp-float-dspadd
sa-mixed
ilp-mixed
Figure 3.13: Area (LUTs) of the ray tracer with multiple number systems and a varying
number of embedded multipliers on the Virtex 4 LX200. This application shows a large
improvement in the case that floating-point adders cannot be mapped to embedded blocks
(as is the case with the Xilinx Virtex II device).
a GARCH financial model in the worst-case because floating-point variables are
capable of storing accurate values around zero. For graphics applications which do
not often require a high degree of precision (fractional width), fixed-point operators
are likely to provide the most efficient solution. This is shown by optimising a
Gaussian blur application using the approach outlined in this chapter. There is
no improvement over a circuit using a fixed-point representation because half of
the inputs are constants; they may therefore be stored in fixed-point without the
need for conversion. The area of the floating-point design is reduced by 22% if no
DSP48s are used. In some cases, mixing arithmetic formats will only provide a small
improvement. Despite this, the most efficient format — fixed-point or floating-point
3.4 Results  61
— can still be selected; it is not always clear which format will result in the smallest
circuit, given a specified number of embedded resources, because as the number of
embedded blocks increases, the optimal format can shift. Even in cases where there
is only a small improvement, the improvement is free in that only accuracy that is
not required is removed. If the accuracy is too high, area will wasted and the clock
frequency and throughput could be reduced.
Ray tracing is an example of a graphics application that often requires a high de-
gree of accuracy. The effect the approach has on a ray tracer (ray-sphere intersection)
is shown in figure 3.13; it contains addition, multiplication and square root functions
(although there may be algorithms that do not use a square root, it is demonstrated
here to show how the approach handles more complex functions). There is a 5%
improvement in area of the ray tracer if no DSP48s are used (39% improvement over
a circuit that only makes use of fixed-point operators). Substituting DSP48s for
embedded multipliers results in a 35% improvement (6% over a fixed-point circuit).
The small improvement is due to the large number of converters required to change
the format of a vector. The overhead of constructing the adders with a floating-point
format is, in some cases, smaller than the overhead of the format conversion.
3.4.4 Additional Case Studies
Uniform cubic B–splines [71] are commonly used in image warping applications. The
area of a circuit realising such an application can be reduced by up to 21% using the
approach described. The accuracy may be higher than typically required, however, if
an ASIC were developed, the circuit would have to cater for every situation because
the device would not be able to be modified once created. A design using floating-
point operators alone is large due to the lack of constant floating-point multipliers,
however, a floating-point multiplier may still be smaller than a constant fixed-point
multiplier and a converter. As well as area being reduced, the number of DSP48s is
reduced by 15%. The time taken to optimise the benchmark follows a similar pattern
to the convolution. This design contains fewer nodes than the convolution, therefore
optimising it does not take as long — a maximum time of 0.83 seconds. In such
cases it is preferable to use ILP to guarantee that the solution is optimal.
The area of a circuit to calculate degree–7 polynomials, sometimes used to
approximate more complex functions [88], is reduced by less than 3% if no DSP48s
are used and 50% if there is an unlimited supply (compared to using floating-point
62  Chapter 3: Reducing Circuit Area using Multiple Data Representations
operators). The limited area reduction is due to the structure of the benchmark.
The multiplications are interleaved with additions — addition immediately follows
multiplication: ((cn−1x+ cn−2)x+ cn−3)x+ ...+ c0. This means that large clusters of
adders or multipliers cannot be found, thus, it is less likely that the area reduction
gained by using the most suitable numerical representation for each operator will be
larger than the area required to convert representations. Using multiple numerical
representations does not improve the area of complex multiplication, but the optimal
representation is selected, either fixed-point or floating-point, depending on the
availability of dedicated resources.
The approach can be employed to reduce the area requirement on different FPGA
architectures. Altera devices use a different embedded multiplier architecture, in
the case of the Stratix II, 9-bit multipliers. A single 36×36-bit multiplier can be
split into four 18×18-bit multipliers or eight 9×9-bit multipliers. The approach gives
slightly smaller improvements for the convolution (6%), GARCH (8%), ray tracer
(no improvement, although constant functional units may provide an improvement)
and fast Fourier transform (4%) benchmarks.
3.4.5 Analysis
The most common approach to mixing numerical representations is to use a floating-
point representation for all of the multipliers and a fixed-point representation for all
of the adders [62]. For most of the applications illustrated, this results in area-efficient
hardware devices. The problem with this method is that it is not guaranteed to
produce the smallest circuits (figure 3.13). Although there are area reductions of up
to 15% using the approach outlined in this chapter, there are limitations.
There will be small area reductions for integer applications because additional
representations, such as floating-point, will have a large overhead; there is no
improvement compared with a fixed-point Gaussian blur, however, the area
is 22% lower than a floating-point design with integer inputs and floating-
point constants. Even if mixing different number systems does not provide
an improvement, the optimal solution using a single number system is still
generated.
Interleaving adders and multipliers (multipliers immediately follow adders or
vice versa) means that clusters of operations in each representation are small
3.4 Results  63
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 10000
 11000
 0  5  10  15  20  25  30  35
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
Improvement
Accuracy
reduction
single-float
single-mixed
reduced-float
reduced-mixed
Figure 3.14: An example of the approach combined with word-length optimisation for the
convolution. First, the accuracy of the floating-point design is reduced (from 23 bits to
15 bits in this case). For this design, the representation is unchanged, however, this is not
guaranteed. Four designs are shown: 32-bit floating-point (single-float ; single-precision),
24-bit floating-point (reduced-float ; 8-bit exponent and 15-bit mantissa) and two circuits
using mixed number representation with the same accuracy as the respective floating-point
circuits (single-mixed and reduced-mixed).
(polynomial approximation exhibits this, having an improvement of less than
3%, as explained in section 3.4.4).
It is not possible to reduce the area by a specified amount. The approach may
be combined with word-length optimisation [2] to generate a circuit with a
mixed representation and a reduced word-length [131], however, this is not done
due to the difficulty of finding a suitable accuracy requirement. The accuracy
of each operator is guaranteed to be identical to its floating-point equivalent.
It is possible to reduce the accuracy of the floating-point operators, reducing
area, however, this may not produce the most efficient solution. It may be
more efficient to transform the floating-point software application into a mixed
representation and then reduce the word-length. This requires an accuracy
metric. A demonstration of how this might work is shown in figure 3.14. Two
designs are compared, one meeting the accuracy of single precision floating-point
and the other, 24-bit floating-point with a 15-bit mantissa.
Provided that the characteristics of the application are known (such as the size of
each variable), the optimisation described is free because the accuracy is preserved.
Coupled with this, the development time is small (less than a second for the designs
shown).
64  Chapter 3: Reducing Circuit Area using Multiple Data Representations
3.5 Proposed Device Architecture
Mixing numeric representations gives an improvement in area of up to 22% (for the
convolution using embedded multipliers). The most efficient architecture for a given
application would show no improvement if numerical representations were mixed even
if format conversion did not require any area. Based on the results presented, a new
device architecture is proposed. Reconfigurable devices are flexible because almost any
hardware circuit can be mapped to the LUTs, flip-flops and routing matrix. Current
FPGAs work well for fixed-point applications, such as image processing, because
they contain dedicated fixed-point multipliers to improve performance. Graphics
processors contain a large array of single and double precision floating-point units,
making them ideal for high-accuracy scientific computation, although they have
limited architectural flexibility. If a device were created with an array of floating-
point units, both adders and multipliers, with reconfigurable interconnect between
them, the flexibility of a reconfigurable device could be combined with the low-power
architecture of a graphics processor. Hardware design would also be simplified.
Two possibilities exist to cater for the high-demand for floating-point units:
1. Include dedicated floating-point units on the FPGA. This is not a new idea.
Ho et al. [64] propose that floating-point units be embedded into reconfigurable
logic. The floating-point adder sits after the output of the multiplier, making
routing more efficient for multiply-add operations. Beauchamp et al. [4] propose
a more general idea in which dedicated shifters are included in place of floating-
point adders to enable fixed-point applications to use the shifter. This method
is more flexible but relies on lookup tables to construct certain functional units.
2. Increase the size of the dedicated multipliers already on the FPGA and include
extra components to construct floating-point units. This would reduce the
number of fixed-point multipliers required to create a floating-point multiplier
and allow fixed-point applications to be efficiently mapped to an FPGA. Virtex 5
devices have a similar 25×18 bit embedded multiplier, however, there is no
floating-point adder and the multiplier is not large enough to perform a full
floating-point multiplication.
The second architecture provides a more generic platform: both fixed-point and
floating-point architectures can be efficiently mapped to the device. It also means that
3.5 Proposed Device Architecture  65
EB
EB
EB
EB
EB
EB
EB
EB
EB
EB
EB
EB
... ... ...
(a) architecture
Compare 
+ Shift etc
48×48-bit Fixed-point 
Adder (+ Round)
NormaliseExponent Addition
25×25-bit Fixed-point 
Multiplier (+ Round)
Normalise
(b) embedded block
Figure 3.15: Proposed architecture to reduce circuit area. This new architecture is de-
signed to be used effectively by a wide range of applications by splitting the floating-
point units into their fixed-point components. The coarse grain dedicated fixed-point and
floating-point adder and multiplier are arranged such that the multiply-add operation
can be constructed efficiently. These embedded blocks (b) are arranged in columns [64],
represented by the larger blocks (EB) in (a); the smaller blocks represent reconfigurable
elements. Additional multiplexers and configuration logic required for routing have been
omitted; see [64] for a more detailed overview.
applications, such as Gaussian blur, can make use of both fixed-point and floating-
point dedicated operators. This alleviates a potential limitation of [64] whereby
fixed-point applications cannot be mapped directly to the dedicated resources — the
architecture would have to be modified. Given that some components of floating-
point operators may not be used often outside of certain application domains, such
as shifters, they are combined with additional logic to make such dedicated blocks
more efficient.
The method used to connect embedded blocks together must also be considered.
Adding values immediately after multiplying them is common (complex multiplication,
for example). For this reason, it may be desirable to place the adder after the output
of the multiplier [64]. If embedded floating-point multipliers were replaced by fixed-
point multipliers they could no longer be connected directly to the floating-point
adder unless hardware were inserted between them to add the exponents and round
the values. Since variable rounding is required by the IEEE 754 floating-point
standard and can be beneficial in fixed-point applications (introducing an error of
2−(width(x)+1) to variable x as opposed to 2−width(x)), it is included. The proposed
architecture is shown in figure 3.15. The floating-point multiplier is split into a
fixed-point multiplier, rounding and exponent addition logic. This way, if fixed-
point arithmetic is needed, the 25-bit multipliers can be connected together. One
66  Chapter 3: Reducing Circuit Area using Multiple Data Representations
 3500
 4000
 4500
 5000
 5500
 6000
 6500
 7000
 7500
 8000
 8500
 9000
 0  5  10  15  20  25  30  35  40
A r
e a
 [ L
U T
s ]
Dedicated Functional Blocks
luts
estimated
(a) cost model
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 0  5  10  15  20  25  30  35  40
A r
e a
 E
s t
i m
a t
e  
[ L U
T s
]
Dedicated Functional Blocks
ilp-fixed
ilp-float
sa-mixed
ilp-mixed
(b) proposed device cost
Figure 3.16: (a) The circuit size of a convolution (section 3.4.1) mapped to a Virtex 4,
determined by the cost model — which estimates the circuit area as if it were realised
on the specified architecture — is compared with the circuit size determined by mapping
the hardware description to the components on the device. The absolute area of each
circuit is less important than the relative area since it is the difference in area of different
operators that is used to select which architecture would result in the lowest cost. (b) The
cost model is adopted to demonstrate the approach on the new device. The area falls to
zero because all of the operators can be mapped to the dedicated resources on the virtual
device. In practice some lookup tables may be added to reduce routing delay, increasing
the clock frequency.
single precision floating-point multiplication can be performed using one dedicated
block. Including a fixed-point adder as part of a multiplier, as in a DSP48, reduces
the number of LUTs required to connect the multipliers together (section 3.2.3,
figure 3.5), potentially increasing the clock frequency. Given that a fixed-point adder
constructed out of LUTs and carry-chain logic is relatively small, it may be omitted.
If double precision floating-point is required, the multipliers can be increased in size.
The approach outlined in this chapter is now applied to the new device architecture.
First, a model must be created to predict the cost of a circuit mapped to the new
architecture. This reduces the time taken to perform the analysis by removing the
high cost of hardware mapping. It is important that the cost model be representative;
the relative cost of one circuit compared with another as determined by the cost
model must be as close as possible to the relative cost if the circuits were realised on
the specified device (figure 3.16(a)). The difference between the cost model and the
area of the circuit is due to routing. As stated in section 3.3.2, the routing overhead
is not included in the cost model because it is dependent on the data-flow graph and
algorithms used to place and route the circuit. The routing architecture would have
to be accurately described, and the cost model would take significantly longer to
3.6 Summary  67
run. Figure 3.16(b) shows that even with floating-point units on the device, mixing
numerical representations under resource constraints still yields an area improvement.
This is a result of devices having to cater for a variety of applications with different
accuracy requirements.
3.6 Summary
Selecting the optimal representation for all of the operators in a hardware circuit is
a complex task. Changing the representation of one operator may result in every
operator connected to it needing its representation changed. Dedicated resources on
the device may also cause the representation to change because some operators can
be fully mapped to dedicated resources while others cannot.
In this chapter it has been shown that mixing different numerical representations
can reduce the area of a hardware circuit despite the overhead of format conversion.
The key aspect is that the accuracy of each floating-point operator is guaranteed
regardless of the representation that is ultimately used. The data representation
problem has been formulated as a set of linear equations which are solved to give the
most efficient representation for each operator. This is efficient for small applications
but often runs too slowly, in which case, simulated annealing is adopted to provide
near-optimal results more rapidly.
Given a floating-point application, the area is reduced by up to 15%. The technique
is then extended to make use of dedicated resources on a field programmable gate
array (FPGA) — embedded multipliers and DSP blocks on Xilinx and Altera devices
— increasing the area reduction to 22%. In conjunction with this, the number of
dedicated resources is reduced by up to 15%.
The approach is currently being extended by integrating more data formats,
for example, the logarithmic number system [49, 121], to reduce area further. As
explained in the introduction, accuracy must be maintained. Conversion to a logar-
ithmic number system may introduce errors that must be reduced by increasing the
word-length of other operators. The effect that mixing number systems has on power
consumption is also being investigated. The reduction in area and possible reduction
in power consumption may be affected by modifying the pipeline. Haydn–C [37] is
being used to check this.
68  Chapter 3: Reducing Circuit Area using Multiple Data Representations
In some cases, transforming an application while maintaining its accuracy is not
a strong enough constraint. Sometimes, the error on the output must be guaranteed
to be lower than a specified value. This is the problem tackled in the next chapter.
CHAPTER 4
Scalable Accuracy-Guaranteed
Word-Length Optimisation
When designing a circuit, the word-length of each variable, array and constant must
be chosen carefully to reduce the area and increase the maximum attainable clock
frequency. Given that most applications use a minimum of 32 bits for each variable
with an unspecified range and precision, the problem of selecting the optimal size of
each operator becomes intractable. The problem is NP–Hard [28], so heuristics have
to be used to guide the search to a near-optimal solution, without getting trapped in
local minima. Word-length optimisation may also be used to increase performance
and reduce the energy used by general purpose processors and dedicated graphics
hardware, although the size of operators is limited, reducing the complexity of the
problem.
There are two methods of word-length optimisation. The first is based on
simulation or statistical analysis [2, 83, 108]. Optimisation is targeted towards a
specific training set. The difficulty is choosing the training set; the simulation is not
guaranteed to produce results within the error requirement for every input. The
second method of word-length optimisation is to guarantee that the error meets a
specification — error models are required [60, 87]. The disadvantage is that the
solution can be conservative, producing designs that are larger than required.
Design exploration keeps the accuracy within a few percent of the optimal solution.
An accuracy-guaranteed methodology [87] is adopted and combined with information
gathered at run time to reduce overestimates (section 4.2). Reducing the run time
of the word-length analysis is essential when transforming software applications
into hardware circuits because it is one of the most computationally demanding
components [16]. Methods of achieving a scalable algorithm are investigated; one such
approach is to partition the data-flow graph. When analysing a 4×4 matrix-vector
70  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
multiplier, simulated annealing (section 4.1.3) runs in 16 seconds. Doubling the
number of variables increases the processing time to over 40 seconds. The heuristic
runs in under a second. The reduction in area caused by using the simulated annealing
algorithm is less than 1% in both cases.
The contributions can be summarised as follows:
1. Aggressive heuristics to estimate non-uniform word-lengths rapidly while satis-
fying error constraints (section 4.1.1).
2. A method of reducing the complexity of the problem (section 4.1.2).
3. Profiling to enable the precision of library functions using an unknown algorithm
to be estimated. Control-flow analysis is proposed to reduce power consumption
(section 4.2).
The benefits are illustrated with case studies, including Gaussian blur, B–splines,
ray tracing, convolution, matrix-vector multiplication and RGB to YCbCr colour
conversion (section 4.3).
4.1 Range and Precision Reduction  71
4.1 Range and Precision Reduction
As mentioned in chapter 2, the range of each variable can be calculated by simulating
the design or propagating each range through the circuit using techniques such as
interval arithmetic [100] or affine arithmetic [117]. Interval arithmetic is a fast range
propagation algorithm but does not use correlation information (shared variables)
to further reduce ranges, as affine arithmetic does. In this chapter, the focus is on
precision analysis and optimisation; range analysis can be performed using affine
arithmetic (section 2.2.2) where possible (run time information may be required if a
function is unknown), interval arithmetic (which is less computationally intensive)
or simulation.
Precision optimisation involves reducing the fractional part of a variable. Each
edge in the data-flow graph corresponding to a fractional data-type is given a word-
length — a range and a precision in this case (if floating-point variables were being
analysed, a sign, mantissa and exponent would be supplied). This ensures that if a
variable is used in several places, each instance can have a different precision. There
are two types of error that can be minimised: relative and absolute. Absolute error
may be less useful if there is a large dynamic range, however, in some applications
the maximum error must be known, for example, safety-critical systems. To calculate
relative error, test data is often used because it is not always clear how the error will
change as the range changes.
The approach adopted in this thesis is based on a compile-time technique to
reduce absolute error because test data may not always be available. It works by
calculating the worst-case error for each variable [87]. The error caused by each
input having a fixed width is 2−FB(x) in the worst case, where FB(x) equals the
number of fractional bits for the fixed-point variable x; if the input has additional
error associated with it, it is added on. The inclusion of round-to-nearest logic as
opposed to truncation would cause the error to decrease to 2−FB(x)−1. This can
be beneficial because the logic required for rounding is small. To ensure that the
accuracy is guaranteed, instead of performing arithmetic on variables, it is performed
on the errors [60], for example:
y = a× b
yerror = (aerror × |b|) + (berror × |a|) +
(aerror × berror) + 2−FB(y) (4.1)
72  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
×
+
a b
y
z
Figure 4.1: Example data-flow graph to illustrate the precision analysis problem using
affine arithmetic. Correlations (shared variables) can result in reduced operator widths in
some cases.
where aerror represents any error associated with variable a, for example, the error
caused by restricting its width. A similar equation is used for addition/subtraction:
y = a+ b
yerror = aerror + berror + 2
−FB(y)
In most cases the error caused by truncation is lower than 2−FB(y) because the
maximum precision width of the output is limited. The error is therefore reduced;
for simplicity this is not shown here.
Equation 4.1 shows that a change in the range of a variable will affect the precision.
As variables are connected together to form a large data-flow graph, correlations will
inevitably begin to emerge. To tackle this, correlation coefficients are incorporated
in an affine form [87]. An addition may therefore be written as follows.
yerror = aerror + berror + 2
−FB(y)3 (4.2)
where aerror is defined as 2
−FB(a)1 if it is an input and has no additional error
associated with it. This equation contains several correlation coefficients; a software
application will contain many more. If correlations exist, these coefficients appear
more than once.
Consider the data-flow graph in figure 4.1. Given that a and b are both inputs,
the errors are calculated based on their fractional width (assuming no additional
error; error may be introduced if these variables are outputs from other blocks):
aerror = 2
−FB(a)1
berror = 2
−FB(b)2
The error along edge y is calculated as above (equation 4.1), with the addition of a
correlation coefficient (equation 4.2). The error along the remaining edge is given as
4.1 Range and Precision Reduction  73
follows:
zerror = yerror + berror + 2
−FB(z)4
The following error function can therefore be derived:
zerror =

2−FB(a)1
2−FB(b)2
2−FB(y)3
2−FB(z)4
2−FB(a) × 2−FB(b)5

·

babs max
aabs max + 1
1
1
1

where aabs max and babs max are the maximum absolute values a and b can take
respectively. The final term, 2−FB(a) × 2−FB(b), is sometimes omitted because it is
close to zero [87]. In some cases, it may be desirable to set a high error constraint
(close to 1 or perhaps higher) when analysing loops. Loops can cause the error to
increase exponentially causing the final term to be much larger. Since this method of
word-length analysis is used as an upper bound for the circuit area, it is considered
in this case.
It is not clear from this error requirement what the precision width should be
to minimise the cost function, and the search space grows non-linearly with the
addition of each variable. In order to produce near-optimal results without covering
the entire search space, a low-effort pass is performed first (section 4.1.1). This
algorithm employs heuristics to reduce the time taken to perform the analysis while
producing near-optimal results. A high-effort pass is designed to be used on the last
iteration of design exploration to cover more of the search space, potentially giving
better results (section 4.1.3).
The first phase of range and precision analysis is to construct a control-data-flow
graph — a control-flow graph in which each node is a data-flow graph. The program
is split into basic blocks, each basic block has no conditional statements or loops.
Input variables must have an input range assigned to them if this cannot be calculated
and output variables must have an accuracy assigned to them. The accuracy on
the output ensures that the circuit that is generated is correct. Accuracy may be
estimated based on metrics, for example, image quality [50].
Arrays are treated as scalar variables [116] because in general, every element of
an array has an equal error associated with it. If an array is used in multiple places
74  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
it may have a different width, resulting in smaller operators; this will not be greater
than the width of the array. Arrays that have been statically assigned, for example,
the array of constants in a Gaussian blur, are not treated as scalar variables since
they are usually small and allow opportunities for additional optimisation. The
difference in error on the multipliers in this case is equal to 2−x + truncate(c, x)− c,
where truncate(c, x) truncates the value c to x bits in length.
Loops cause the algorithm to run significantly more slowly. Coupled with this,
accuracy-guaranteed approaches employ compile-time analyses and thus cannot
always be used to determine the number of iterations. Existing approaches [2, 87]
examine loops by unrolling them, however, this technique is not practical for programs
with large iteration spaces. To tackle loops in which the number of iterations is
unknown, they are profiled if the range and precision of variables are affected
(determined by looking at loop-carried dependencies); in all other cases loops are
ignored, reducing the time required to run the algorithm. Calculating the number of
iterations of a loop will cause the algorithm to be less conservative because additional
constraints are imposed on the width of the variables; there is therefore more scope
to reduce circuit area. Decreasing the range means that the precision may be able to
be reduced (see section 2.2.3), hence, if the number of iterations of a loop cannot be
determined at compile time, a run-time analysis is undertaken.
4.1.1 Low-Effort Pass
The low-effort pass invokes an error function, generated by using a compile-time
analysis developed with ROSE [107], to check whether the selected variables are
capable of storing values accurately enough to satisfy the error constraints. The first
stage is to select the minimum uniform precision width, as described in section 4.1.2.
Constantinides et al. [34] have shown that using the same width for every variable
produces unnecessarily large designs. To improve results, the widths of each operator
are allowed to take a different value. First, the uniform precision width is increased
by a constant amount.
initial = u0 + c, u1 + c, ..., un−1 + c
This can improve results because more of the search space is available, simplifying
the analysis in [108] whereby each word-length is increased at a later stage. Each
word-length is then gradually reduced until the accuracy constraint is broken. The ad-
vantage of not increasing the word-length, unlike simulated annealing (section 4.1.3),
4.1 Range and Precision Reduction  75
is that if a word-length cannot be reduced further, it never needs to be reduced in a
future iteration, decreasing the time to run the algorithm.
∀i
{
pi > 0 reducedi = p0, p1, pi − 1, p2, ..., pn−1
otherwise reducedi = p0, p1, pi, p2, ..., pn−1
The set, reducedi, with the lowest error is chosen and the process repeated. The
algorithm terminates when the cost cannot be reduced without breaking the error
requirement. Selecting the correct variable width to reduce is important because
modifying the width of one variable can have a cascading effect on the others. For
this reason, the cost and error of reducing the width of a variable are calculated.
To obtain near-optimal results quickly, each word-length is reduced while keeping
every other word-length the same; the reduction that causes the smallest increase
in error will be selected first. If there are several solutions with the same error, the
one that reduces the cost by the greatest amount is chosen. It may be desirable
to perform a similar analysis in which the variable causing the greatest decrease in
cost is selected first since this takes less time, but it may not produce results of the
same quality. Software designs containing loops with a large number of iterations
cause the error analysis to run significantly more slowly than the cost analysis. For
this reason, finding the most area-efficient circuit may be less important than the
time taken to run the algorithm. The heuristic algorithm outlined is compared with
another algorithm taking less time which may be used in such cases. A word-length
is selected randomly and reduced. This is repeated multiple times, the reduction
that results in the smallest error increase is chosen.{
pi > 0 reducedi = p0, p1, pi − 1, p2, ..., pn−1
otherwise reducedi = p0, p1, pi, p2, ..., pn−1
where pi is selected randomly. This is a fast algorithm because each word-length is
chosen randomly and produces good results because both error and cost are used to
determine which word-length to reduce. One advantage of such heuristics is that
they can be expanded to calculate the width of operators based on an area or power
reduction constraint. Given a percentage reduction in cost, the minimum error to
achieve this can be calculated by running the algorithm until the area constraint is
met, ignoring the error constraint.
The approach of simply reducing the width of variables will not always result in
the optimal solution. It may be the case that increasing a word-length will ultimately
76  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
 19000
 20000
 21000
 22000
 23000
 24000
 25000
 26000
 27000
 0  100  200  300  400  500  600  700
A r
e a
 [ L
U T
s ]
Iteration
partition-full
partition-50
partition-33
partition-15
(a) area
 0
 2
 4
 6
 8
 10
 12
 0  100  200  300  400  500  600  700
T i
m
e  
[ s ]
Iteration
partition-full
partition-50
partition-33
partition-11
(b) algorithm run time
Figure 4.2: Area and algorithm run time at varying partition sizes for the convolution
benchmark (section 4.3.2). Partitioning the data-flow graph reduces the complexity of
the word-length analysis, thus decreasing the time taken to run the algorithm. A small
increase in area leads to a large reduction in run time.
produce better results because another word-length with a higher cost-to-error ratio
can be decreased. For this reason, a more intensive (high-effort) pass is used to cover
more of the search space (described in section 4.1.3).
4.1.2 Complexity
Reducing the size of the search space is important because it can result in a reduction
in run time. The first method of reducing the search space involves calculating a
uniform precision width for every variable. A binary search reduces every word-length
simultaneously to the lowest value satisfying the error requirement. This is one of
the simplest forms of word-length optimisation.
Heuristics must select a word-length to reduce based on cost1 or error (absolute
or relative). Selecting where this reduction should occur can be time-consuming
because every variable in the data-flow graph may need its effect on error and cost
calculated. The data-flow graph is therefore partitioned to reduce the time taken
to run the algorithm. For small programs the low-effort pass runs quickly giving
near-optimal results in most cases, however, for larger programs it may take minutes
or even hours. Partitioning the data-flow graph results in an algorithm with lower
complexity. In order to select a word-length to reduce, the cost function and/or error
function must be run n times where n is the number of variables. As n increases,
the algorithm slows down. By partitioning the data-flow graph, this number is
1Cost may refer to area, power etc.
4.1 Range and Precision Reduction  77
reduced to p which is constant regardless of the size of the graph. A word-length is
selected from each partition in turn to avoid one group of variables being targeted
more frequently than other variables. Figure 4.2 shows the effect of partitioning the
data-flow graph (each partition having the same size) for a floating-point convolution
(discussed in section 4.3.2). As the algorithm progresses, the area drops because the
width of each adder and multiplier is decreasing, using fewer logic blocks. The graph
shows that using larger partitions produces slightly smaller hardware designs but the
algorithm takes significantly longer to run. Selecting a larger partition size can cause
the area to be reduced more slowly because adders may be reduced in preference to
multipliers.
The algorithm may need to be run many times before the final solution is generated
if it is part of a larger design flow. It is therefore important to reduce the run time.
For more extensive coverage of the search space, a high-effort pass (section 4.1.3) is
executed on the last iteration of design exploration. A summary of the algorithm is
shown in figure 4.3. The steps involved are as follows:
1. A binary search finds the lowest uniform precision width meeting the error
requirement (line 2).
2. This width is then increased (line 5) by a constant amount, C, to more fully
explore the search space (a less complex way of achieving a similar result
as [108] in which each word-length is increased at a later stage).
3. The data-flow graph is split into partitions (line 8) of equal size. This reduces
the complexity of the algorithm because there is a smaller group from which to
select a candidate width to reduce. A trade-off exists: the larger the partition
size, the slower the algorithm but the lower the cost (for example, circuit area
or energy if applied to a hardware circuit, or energy if applied to a software
application); the bandwidth may also be maximised.
4. A word-length is selected from each partition in turn (line 23) to be reduced
based on the error generated by reducing the width. A variable is marked as
invalid (line 28) if it cannot be reduced without breaking an error constraint.
This limits the number of times that a potentially costly error function must
be executed. The error must be propagated through a program because each
statement, conditional and iteration of a loop can affect the error.
78  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
1 // Uniform p r e c i s i o n a n a l y s i s .
2 uniform = binary search over precision widths
4 // A s s i g n to p r e c i s i o n a r r a y .
5 precision_widths = uniform + C
7 // Equal s i z e d p a r t i t i o n s .
8 partitions = divide precision_widths into partitions
10 // V a l i d i t y o f each p r e c i s i o n width .
11 valid = true
13 // Loop through p a r t i t i o n s .
14 do
15 globalChange = false
17 for p in partitions do
18 localChange = false
19 min_error = -1
21 for i = 1 to p.size do
22 if valid[i] then
23 p[i] = p[i] - 1
24 e = error analysis
25 p[i] = p[i] + 1
27 if error constraint is broken then
28 valid[i] = false
29 else if e < min_error or min_error < 0
30 min_error = e
31 chosen = i
32 localChange = true
33 end if
34 end if
35 end do
37 // Reduce a word−l e n g t h .
38 if localChange then
39 p[chosen] = p[chosen] - 1
40 globalChange = true
41 end if
42 end do
43 while globalChange
Figure 4.3: A summary of the word-length reduction algorithm. The error analysis (line
24) takes as input the precision widths and propagates error through the data-flow graph.
4.1 Range and Precision Reduction  79
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) algorithm run time
Figure 4.4: Area and algorithm run time for the B–splines benchmark at varying levels
of precision. A design having precision x is guaranteed to produce results of the specified
accuracy, 2−x. Simulated annealing, sa, produces solutions that are closer to the optimal
solution (shown to be within 1% for small designs [87]) than the heuristic algorithm,
heuristic, however, it takes longer to run the algorithm. The difference in area is small
and does not warrant the extra time taken. These algorithms are compared to a design
generated using a uniform precision width, the simplest form of precision optimisation.
5. The algorithm terminates (line 43) when no further changes can be made
without breaking the error requirements.
This algorithm is designed to generate results rapidly. Parts of the algorithm may be
run in parallel because each variable has its error (or cost) analysed independently,
increasing performance.
4.1.3 High-Effort Pass
Simulated annealing [79] employing a geometric cooling schedule is executed after
the low-effort pass to cover more of the search space. Simulated annealing has been
shown to produce results within 1% of the optimal solution [87] for small circuits.
The key difference between the low-effort heuristic and the high-effort pass is that
the high-effort pass can increase each word-length as well as decreasing it. As shown
in section 4.3, the area saving gained by covering the search space more fully does
not warrant the extra time taken. Figure 4.4 shows the area of a B–splines circuit
commonly applied in image warping applications [71]. The design has 4 outputs.
The error on each output is given on the x-axis; thus a precision of 1 bit results in a
maximum error of 0.5 on the output. Every output has the same error constraint.
The graph shows that simulated annealing, sa, only produces marginally better
results (approximately 1% lower area) taking 58 times longer to run on average.
80  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) algorithm run time
Figure 4.5: Area and algorithm run time for the B–splines benchmark with variable output
precision (16 bits on one of the outputs). A uniform precision width is not as common
in larger applications because a single variable with a large width results in every other
variable having its width increased. Heuristic algorithms will tend to run more slowly if
a large uniform word-length is chosen because the search space is larger. It is therefore
important to test heuristics on systems that have a large uniform precision — much larger
than each individual word-length. The heuristic generates near-optimal solutions in a
fraction of the time.
Figure 4.5 shows the effect of assigning a different width to certain outputs —
the outputs do not have the same error requirement — to exaggerate the differences
between the algorithms. A single output has had its error fixed to 16 bits. This
results in the uniform word-length being higher than required, commonly occurring in
larger applications because the operator sizes will vary to a greater degree. It makes
the problem more difficult to solve because more of the search space is available. It is
important that any heuristics produce near-optimal solutions under such conditions
given that the goal is to target much larger applications — although the hardware
circuits shown in this chapter are small, software applications tend to have many
more variables. Chapter 5 shows how this approach can be applied to reducing the
power consumption in general purpose processors. If the algorithm were part of a
compiler, run time would be important.
4.1.4 Application to Different Number Systems
This approach can be applied to different number systems to reduce cost, although
the error function may need to change. Several factors must be considered when
selecting different number systems.
The dynamic range required. This is shown more clearly in figure 3.2 (sec-
tion 3.2.1): the relative and absolute errors vary in different ways depending on
4.2 Run-Time Analysis  81
the number system, for example, fixed-point or floating-point. A high dynamic
range often comes at the expense of absolute accuracy (guaranteed in this
chapter) but provides a low relative error.
The number of functions and their relative sizes. A logarithmic number
system can sometimes result in smaller circuits if there are a large number of
multiplications relative to the number of additions [120].
The sensitivity around a given range. Floating-point variables have greater
sensitivity around zero and could thus be more suitable if relative error were
more important than absolute error.
Answers to these questions can be given by profiling the application [50]. As
discussed in chapter 3, floating-point accuracy is assumed. If a low relative error
is required, fixed-point will not perform well because as the range decreases, the
precision remains constant; in the same way, floating-point will not be well suited
to applications requiring a low absolute error if the maximum value stored is large.
Although the algorithm can be applied to any number system (with a modified
error function), simulation is the only way to guarantee that the accuracy is within
tolerances for many applications, such as those generating images. Such approaches
require run-time information.
Many architectures support several different floating-point formats: 32-bit and
64-bit in many general purpose and graphics processors; many also support 16-
bit floating-point. Selecting a reduced precision will reduce power consumption
(chapter 5). One disadvantage of this approach is that it will be very conservative if
a large range is infrequently used. The width of a floating-point variable is fixed,
but its range may change, giving a reduced precision. The worst-case error in this
case is large.
4.2 Run-Time Analysis
Compile-time analysis on its own produces conservative results. To tackle this,
profiling is adopted to extend the analysis, solving three problems:
1. The range of each variable may not be as narrow as it could be.
82  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
-3
-2
-1
 0
 1
 2
 3
 4
 0  1  2  3  4  5  6  7  8  9  10
f ( x
)
x
sqrt(x)
log(x)
Figure 4.6: Function approximation for square root and logarithm. The function is divided
into non-uniform segments (typically in excess of 200 [48]). Areas around zero often require
more segments because the sensitivity is often higher.
2. Black-box functions cannot be assigned a precision without knowledge of the
sensitivity of the output to variations on each of the inputs (section 4.2.1).
3. Minimising energy cannot be effectively achieved without knowledge of the
control flow within an application (section 4.2.2).
4.2.1 Black-Box Functions
Software applications often contain library functions, such as square root, to enable
an algorithm to be executed efficiently. Several options exist to create a hardware
circuit with the same functionality:
The first method, table lookup, requires a small area and is often applied if a
function has a limited number of inputs (a fast Fourier transform, for example)
but does not work well if the range of inputs is large.
Function decomposition often simplifies circuit generation, for example, h(x)
could be evaluated as f(g(x)). It is not trivial to automate this approach,
requiring components of the data-flow graph to be matched to known functions.
Function approximation extends function decomposition to arbitrary func-
tions [88], providing a more scalable approach than table lookup. The domain
of a function is split into segments based on the output sensitivity. Each
segment is then approximated with a polynomial, making error analysis chal-
lenging because errors on the inputs may lead to the wrong segment being
4.2 Run-Time Analysis  83
approximated. It is therefore necessary to evaluate the error caused by ap-
proximating the function and the error caused by using a finite input width.
Figure 4.6 shows that the output from a function can fluctuate rapidly as the
inputs approach a given value, in this case, zero. Such functions may require a
large number of coefficients to be calculated.
Koren and Zinaty evaluate elementary functions (for example, logarithm and
exponential) in a coprocessor [81]. This is the most flexible method, however,
it is the least efficient because an entire processor is required. It is best suited
to applications requiring infrequent function evaluation; the processor could
also be utilised by other elements of the system.
Iterative approximation in which an algorithm gradually improves a given solu-
tion. These methods often use less hardware but may have a high latency [25].
The disadvantage with these generic approaches is that the functional units may not
be as efficient as manually optimised cores; the advantage is that arbitrary functions
can be realised as circuits.
It is not always clear which algorithm a function in a software application uses.
An approach to analysing the accuracy of such functions is shown which may be
combined with one of the methods above to generate a function for a given hardware
device. Automatic differentiation provides information about error, and subsequently,
word-length. This method is used to calculate variations in sensitivity to input
data changes for any function, beyond primitive operators such as addition and
multiplication. The width of function operands required to guarantee a given error
can be calculated from this information. Consider the function y = f(x1, x2, ..., xn).
The sensitivity of y, ∆y, can be expressed with a Taylor series as follows.
∆y ≈
n∑
i=1
∆xi
∂y
∂xi
Higher order terms are sometimes omitted if the affect on error is small. Abdul
Gaffar et al. [2] show how error can be propagated using this method. Errors on the
input are calculated by the algorithm used to reduce the word-length. The error can
therefore be propagated forward, resulting in more accurate error calculations for
functions with multiple inputs.
yerror ≈ 2−FB(y) +
n∑
i=1
xi error
∂y
∂xi
84  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
1 float a, b; float a;
2 ... ...
4 if (condition) if (condition)
5 a = a * 2; a = a * 2;
6 else else
7 b = b * 2; a = a / 2;
Figure 4.7: An example showing how conditional statements affect error. Left: the error
on both variables increases by a factor of 2. Right: The error on the variable increases by
a factor of 2 despite one path through the program resulting in a decrease in error; the
worst case must be assumed. A run time analysis can, in some cases, reduce the widths
further by looking at branch probability.
This assumes that the result is truncated. To perform the calculations illustrated
above, the source code being analysed is instrumented with calls to a library.
The approach shown here uses automatic differentiation on any unknown function
where possible. Figure 4.9 shows the effect of word-length optimisation on a ray
tracer, using black-box function analysis to determine the word-length of a square
root core. Given that any function could be used, it is not possible to automatically
calculate how its error changes as its inputs change without input data. One option
to reduce this conservative error estimation is to map a function name to a given
function which will work well if there are a limited number of library functions.
4.2.2 Control Flow
The application is partitioned into a set of basic blocks; each block has one entry
point and one exit point. Branches are an important aspect of precision analysis
because they affect the error. Consider the conditional statement in figure 4.7 (left)
in which both a and b could be modified. Given that the analysis is performed at
compile-time and guarantees the accuracy of both a and b, both have an increased
error, as if the conditional statement were not there. In the example on the right, it
is assumed that a takes its error from the true branch of the if condition because
this is the worst case with regards to error.
Control-flow analysis can also be used to reduce area and power. If a block of
code is executed frequently, it can have a large influence on the error of the final
result. If a block is not called many times it is likely that the error contributed by
that block will be small, so it may be possible to reduce the precision of the variables
used in this block and others connected to it.
4.2 Run-Time Analysis  85
1 float a, b, x; float a, b, x;
2 ... ...
4 if (condition) if (condition)
5 a = a * 2; a = a / 2;
6 ... ...
8 b = a * x; b = a * x;
Figure 4.8: An example showing how control-flow analysis can reduce energy consumption
in FPGAs and custom processors. When certain branches are taken, the error decreases.
In this case, it is often possible to reduce the width of the operators, reducing power.
Knowledge of branch probability [118] and how the error is affected by each branch is
important when reducing energy. Left: if condition is false, the error on a and b is
reduced. Right: if condition is true, the accuracy on a and b is reduced.
The number of loop iterations is calculated (at compile time or run time). If this
number were to change, the width of a given variable could change: one width for
each phase [119]. Consider a loop containing loop-carried dependencies. In this case,
the error accumulated could be large if error were to accumulate on each iteration.
If the number of iterations were reduced, the error would decrease. If the error
decreased, the word-length of variables influenced by the loop could be decreased to
maintain the same output error. Another approach is based on branch prediction. If
a branch were unlikely to be taken, the circuit area could be reduced (by removing
the logic for the branch) [118], relying on reconfiguration if the branch were to be
taken (thus slowing the circuit down every time the branch is taken but saving power
and potentially allowing the circuit to operate at a higher clock frequency when the
branch is not taken). In chapter 5, this information is exploited to reduce the energy
of a hardware design.
Control-flow analysis has applications in custom processor design because the
word-length of operators needs to change based on the programs using them. The
tools can be employed to dynamically change each word-length at run time based
on user-supplied stimuli, such as custom instructions. Figure 4.8 shows an example
of how this approach can reduce the energy required by the device. Consider what
happens if condition is true (left). If the error on a is represented as aerror then:
(a+ aerror)× 2
causes the error to be doubled. The width of the variables may therefore need to be
increased to maintain the same accuracy on the output. If condition is false, the
86  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
error on a in the worst case is reduced. The example on the right is similar except
that the error is reduced if condition is true. Bits could be removed or switched off
to save power.
Reducing power requires more than one word-length analysis to take place:
one assuming that condition is true and one assuming that condition is false,
emphasising the need for a fast analysis. A similar technique is used when analysing
loops based on the number of iterations. Although area is reduced using the approach
described in this chapter, the cost function could simply be replaced to target power
consumption [1].
4.3 Results
The graphs show the run time2 of several algorithms and the corresponding area:
uniform Every variable is given the same number of precision bits: the
minimum number required to satisfy all of the error constraints.
sa Simulated annealing covers the search space more fully, however, it takes
significantly longer (section 4.1.3).
error heuristic The width of each variable is calculated based on the error
each variable causes, similar to the components of [108] in which each word-
length is reduced but not increased.
rand heuristic A fast heuristic that calculates the error of a group of
variables, reducing the complexity (section 4.1.1).
heuristic Heuristic described in section 4.1.1 that partitions the data-flow
graph, reducing the complexity of the problem.
4.3.1 Ray Tracing
Ray-object intersection is the most computationally intensive part of a ray tracer,
executed millions of times for a single scene. A ray-sphere intersection kernel is
shown in appendix B (figure B.1). FPGAs have an ideal architecture on which
to develop a ray tracer because they can handle the high computational demands
2All results were collected using an Intel Core2 Duo 3.00GHz processor with 4GB RAM.
4.3 Results  87
 5000
 6000
 7000
 8000
 9000
 10000
 11000
 12000
 13000
 14000
 4  6  8  10  12  14  16  18  20
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 4  6  8  10  12  14  16  18  20
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) algorithm run time
Figure 4.9: Area and algorithm run time for the ray tracer benchmark with variable
output precision. There is only a 5% difference between a design with uniform precision
and one optimised with simulated annealing. For this reason, the uniform precision has
been increased which is more representative of a kernel that is part of a larger application.
Even with this constraint, the heuristic is within 2% of simulated annealing (shown to
be within 1% of the optimal solution for small circuits [87]). Automatic differentiation is
used to calculate the error properties of the square root, required to calculate the output
accuracy.
imposed upon them. The smaller the area of this kernel, the greater the number
of cores that can fit on the device and the higher the throughput of the ray tracer.
Word-length optimisation is utilised to reduce the area of each core.
Ray tracers take minutes or even hours to complete a scene. For this reason, using
simulation as a method of calculating the optimal selection of variable widths is not
an option. Figure 4.9(a) shows that the area of a ray tracer circuit can be reduced
while guaranteeing accuracy. Due to the high accuracy requirement, the input range
for every variable has been reduced to between 0 and 1 to reduce the area. This
application is used to show how functions can be analysed without the availability of
source code. Although accuracy cannot be guaranteed for arbitrary inputs, it can
be guaranteed for a given test set. The analysis runs in less than 2 seconds and is
unaffected by any unknown functions encountered. This is achieved by annotating
the source code with the worst-case error introduced by the function. This error is
then propagated as it is with any known function. The heuristic algorithm runs over
10 times faster than simulated annealing with an increase in area of less than 2%.
An increase in precision results in a small decrease in the time taken to run the
algorithm because the uniform word-length is closer to the optimal solution, reducing
the size of the search space (figure 4.9(b)).
88  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 2  4  6  8  10  12  14  16
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
rand_heuristic
err_heuristic
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 2  4  6  8  10  12  14  16
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
err_heuristic
rand_heuristic
(b) algorithm run time
Figure 4.10: Area and algorithm run time for the convolution benchmark at varying levels
of precision. A 3% area reduction using the heuristic algorithm compared with using the
error heuristic is shown (the low-effort pass in [108]).
4.3.2 Convolution
Figure 4.10 shows the area of a floating-point convolution, containing 25 multipliers
and 24 adders, having reduced the word-length with the algorithms discussed. The
first input array has range 0 to 1 and the second, 0 to 255. The second input will
cause significantly more error because the range is much larger. The optimal solution
will therefore be generated by reducing the precision width of the second input array
by a larger amount. Any reads from the same RAM should have a similar precision to
reduce the width of the RAM. In this case, the width will not change much because
each multiplier connected to the RAM has the same input characteristics (error,
range and sign). In the general case, there is a trade-off between the size of the RAM
and the size of the operators.
The goal of each algorithm discussed is to minimise the area of the circuit while
running in as little time as possible. Simulated annealing runs significantly more
slowly because more of the search space is covered. It may, for example, be more
efficient to increase the width of one variable in order to decrease the width of
another. This solution would not necessarily be explored by the heuristic algorithm
as explained in section 4.1.1. Figure 4.11 shows that the time taken to run the
algorithm can be significantly reduced if the data-flow graph is partitioned, increasing
the area by a small amount. The area increase is less than 2% and the algorithm runs
25 times faster. The data-flow graph is partitioned if there are several uncorrelated
outputs, however, even if the partitioned components were correlated, the time taken
to run the algorithm would be significantly reduced for a small increase in area. The
4.3 Results  89
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 2  4  6  8  10  12  14  16
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 2  4  6  8  10  12  14  16
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) algorithm run time
Figure 4.11: Area and algorithm run time for the convolution benchmark at varying levels
of precision using a partitioned data-flow graph. The algorithm runs over 10 times faster
than the low-effort pass in [108] with a 2% improvement in area.
heuristic gives up to 3% area improvement compared with the error heuristic because
the search space is increased, allowing a greater reduction in word-length.
If partitioning the data-flow graph proves ineffective, another heuristic (discussed
in section 4.1.1), rand heuristic, can be used which runs in approximately 1 second,
producing results within 3% of simulated annealing, as shown in figure 4.10.
Figure 4.12 shows the effect of doubling the number of variables, emphasising
the need for a scalable algorithm. Doubling the number of variables causes the run
time of the heuristic to increase to almost 100 seconds, along with the run time of
heuristics previously proposed [34, 108]; if the data-flow graph is partitioned, the
time taken to run the algorithm decreases to under 2 seconds while the area of the
design is within 2%. This reduction in run time is due to the complexity of the
algorithm decreasing.
4.3.3 Matrix-Vector Multiplication
The majority of applications have multiple outputs, including the 4×4 matrix-vector
multiplier. Although this application could be split into four, this is not done here
to illustrate how the algorithm behaves with a larger number of variables. When the
data-flow graph is partitioned, each word-length is assigned to a partition arbitrarily.
The outputs are not assigned the same width to highlight what happens when the
uniform word-length produces a poor result (as explained in section 4.1.3). An
area improvement of 1% is gained by using simulated annealing but it takes over 10
times longer to run. When the data-flow graph is partitioned, the heuristic runs 6
90  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
 0.01
 0.1
 1
 10
 100
 1000
 10000
 2  4  6  8  10  12  14  16
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
err_heuristic
rand_heuristic
(a) not partitioned
 0.01
 0.1
 1
 10
 100
 1000
 10000
 2  4  6  8  10  12  14  16
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) partitioned
Figure 4.12: Algorithm run time for the convolution benchmark with twice the number of
variables at varying levels of precision. If the data-flow graph is partitioned, the algorithm
runs in less than 2 seconds compared to almost 100 seconds without partitioning; the area
is within 2% of simulated annealing which takes over 100 times longer to run.
times faster for an area increase of less than 1%. The algorithm runs over 60 times
faster than simulated annealing; the heuristic runs in under a second while simulated
annealing takes over 40 seconds. Additional graphs can be seen in appendix B.
4.3.4 Analysis
Previous approaches to the problem focus on reducing the area [60, 87] and power [1] of
a circuit while guaranteeing the error is within specified limits. Limited consideration
is given to reducing the complexity of the algorithms. It has been shown [87] that
finding the optimal solution with integer linear programming is not always possible
due to time constraints. For this reason, alternate algorithms, such as simulated
annealing, are employed. Simulated annealing produces near-optimal results, but
it is time-consuming because the search strategy involves randomly selecting a
word-length to increase or reduce. Heuristics have been proposed to reduce the run
time [34, 46, 83, 95, 108]. As circuit size increases, the run time of such algorithms
will increase. Limiting the number of calls to the error function is important because
the error function will be costly if loops with a large iteration space exist. Using a
cost function may ultimately mean more calls to the error function [34].
Table 4.1 shows a comparison of the heuristic algorithm described in this chapter
with simulated annealing [87] and the low-effort part of [108]. It shows that the
heuristic described in this chapter runs significantly faster than simulated annealing
and the error heuristic on larger applications. Doubling the number of variables
4.4 Summary  91
Benchmark Area Run time Run time Run time
increase (sa) [87] (error) [108] (heuristic)
convolution 1% 97 5.3 0.4
convolution (double) 2% 185 42.8 1.6
matrix-vector 1% 19 1.1 0.4
matrix-vector (double) 1% 42 8.3 0.8
Gaussian blur 2% 42 2.8 2.3
ray tracer 2% 22* 0.2* 0.7
B–splines 2% 15 0.3 0.3
RGB to YCbCr 0% 10 0.3 0.4
polynomial approximation 2% 18 0.2 0.3
Table 4.1: A comparison of area and run time. * Assuming that the algorithm could
recognise a square root, and in the general case, any library function. The heuristic
algorithm can analyse library functions such as square root even though no source code is
provided (section 4.2).
(indicated in table 4.1) will not necessarily increase the run time as rapidly as the
other algorithms tested because the size of the partitions can be reduced. For the
applications tested, the increase in area is 2% or less for the heuristic, compared with
simulated annealing (shown to produce results within 1% of the optimal3 solution
for circuits with a small number of variables [87], such as degree–8 polynomial
approximation). Although the results shown are for small software applications of
up to 200 variables, the time taken is much greater than software compilation for
similar applications. In some cases, the applications could be partitioned because
the inputs and outputs are independent of each other, however, this is not done,
illustrating how the approach handles designs with a larger number of variables.
The techniques outlined in this chapter can be extended to reduce power con-
sumption. Algorithms designed to reduce area will not always produce the most
power-efficient hardware [1]. This is because power is dependent on word-level
statistics [12, 15, 57, 58, 72]. Given that the heuristic described in this thesis is
designed to generate solutions at compile time, the circuits will not necessarily be
the most power-efficient.
4.4 Summary
In this chapter, an approach to compile-time word-length optimisation has been
described that rapidly produces near-optimal solutions. It has been shown that
3Optimised using integer linear programming with ILOG CPLEX.
92  Chapter 4: Scalable Accuracy-Guaranteed Word-Length Optimisation
decreasing the width of operators in a data-flow graph can reduce area [116] and
power consumption [30] while guaranteeing specified accuracy constraints. The
problem of selecting the optimal width of each operator is NP–Hard [28]; it is
therefore impractical to find the optimal solution to large problems. Many heuristics
have been proposed [34, 46, 108] based on the impact of reducing a word-length on
the error produced and the cost (area, power consumption etc.) of the final hardware
circuit.
Two approaches have been proposed: those relying on simulation or statistical
models [2, 46] and those that guarantee that the error meets a given specification [60,
87]. Approaches based on simulation require test data that may not always be
available. For this reason, error models are constructed that guarantee that the
error is within given tolerances. The goal is to create a heuristic that produces
near-optimal results while being able to scale to larger problems, without a large
increase in run-time. Compared to previous approaches [87], the heuristic algorithm
runs 10 times faster for a convolution design with less than 1% increase in area. To
decrease the algorithm run-time further, the data-flow graph is partitioned; instead
of looking at the impact of reducing every word-length, only a subset are checked,
which reduces the complexity of the algorithm. This results in a further 1% increase
in area but a decrease in algorithm run-time of 25 times for a convolution design.
Profiling is combined with the compile-time analysis to produce less conservative
results and, in some cases, allow it to proceed. If a function uses an unknown
algorithm it is not always possible to assign a word-length that guarantees that a
given error requirement is satisfied. In this case, functions are differentiated [75] to
determine the sensitivity of the output error, given an input error. This enables a
word-length to be assigned that meets a given error constraint on the output, for a
given test set.
Rapid word-length analysis will enable every variable in an application to be
assigned a width. When certain bits of a word are not required to produce results
of sufficient quality, they can be switched off to reduce energy. This is the problem
tackled in the next chapter.
CHAPTER 5
Energy Reduction by Systematic Run-Time
Hardware Deactivation
The high cost required to develop hardware devices has led to the adoption of
reconfigurable technology to enable a circuit to be completely restructured, and
potentially fixed, without fabricating a new one. Although these devices are flexible,
they may not always be as efficient as application-specific integrated circuits (ASICs)
— dynamic power is up to 12 times higher on average for a variety of circuits (9 times
when embedded blocks are used) [84]. Reconfiguration can, however, be exploited to
reduce power consumption because the circuit can be adapted to suit the current
scenario. Two methods have been proposed to achieve this. The first is bitstream
reconfiguration that involves reconfiguring the circuit [73], possibly deactivating part
of it. The second is multiplexer-based reconfiguration, in which parts of the circuit
are selected based on input stimuli [36, 93, 122]. Multiplexing has the advantage
that it can be applied when the circuit cannot be modified.
Designing power-efficient circuits is a challenging task [47]. One method is to
reduce the dynamic power. Clock gating [80, 115] — disabling parts of a circuit
when not in use — has been shown to reduce dynamic power [133], although this
is not always the case [21]. Clock gating is not always possible in FPGAs, so
alternative approaches are investigated. A method of multiplexing the input to
arithmetic operators, which involves feeding constant zero into parts of the circuit, is
combined with word-length optimisation to reduce energy. Bitstream reconfiguration
may be applied to give a similar result, although it has disadvantages. The long
reconfiguration time and high power consumption incurred while reconfiguring the
chip can lead to inefficient hardware devices given that a high reconfiguration
frequency may be required. In order to determine the most efficient approach, the
size of the reconfiguration interval must be known.
94  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
To summarise, multiplexer-based reconfiguration is fast and power-efficient, but
it results in circuits with a large area and high power consumption. Bitstream
reconfiguration requires more time and power to configure the chip, but it often
provides smaller, more power-efficient circuits during operation. The aim is to show
how these two methods can be combined with word-length optimisation to produce
energy-efficient devices. The innovative elements of the proposed approach are:
1. Two methods, multiplexer-based reconfiguration and bitstream reconfiguration,
are combined with word-length optimisation to develop run-time reconfigurable
circuits (section 5.1).
2. Derivation of the conditions under which multiplexer-based reconfiguration
should be chosen in preference to multiple bitstream reconfiguration (sec-
tion 5.2).
3. Comparison of the two different reconfiguration strategies (section 5.3).
The approach is demonstrated for various case studies, including ray tracing, B–splines,
vector multiplication and inner product (section 5.4).
5.1 Methodology  95
5.1 Methodology
The approach has three elements:
Word-length analysis is used to determine where to save power (section 5.1.1).
This involves locating parts of the design that are not required at a given instant
so that they can be deactivated. The components are either separated into
different bitstreams (bitstream reconfiguration), or multiplexed (section 5.1.2)
and deactivated when not required (section 5.3.1).
A model to determine when to use the different strategies to reduce energy
(section 5.2).
A reconfiguration strategy to determine how to save power (section 5.3).
5.1.1 Word-Length Selection
Word-length optimisation involves reducing the width of variables such that the
area and power consumption of a circuit can be minimised. This, combined with
a reconfiguration strategy, can result in an energy reduction, achieved by various
approaches; below, one is shown based on combining bitstream reconfiguration and
multiplexing to adapt the circuit at run time. The precision of a variable can be
modified at run time to improve performance and reduce energy. The reduction in
precision is calculated based on the error constraints, typically on the output.
Every variable involved in arithmetic operations has an associated range and
precision. The range and precision reduction is accuracy-guaranteed [87] which
means that any results have a given accuracy, irrespective of the input data. Since
these results will be conservative, run-time analysis can be used so that the results
are guaranteed for a specific set of input data. Arithmetic is performed on the error
associated with each variable such that the worst-case error on the output can be
calculated given errors on the inputs (discussed in more detail in chapter 4).
In this chapter, the problem of selecting the optimal width for each operand is
generalised. Consider a set of variables, wl, where wli is the word-length of variable
i. One of the goals of data representation optimisation is to select wl such that the
cost function is minimised. Here, this definition is extended by defining multiple
sets, the word-length of variable i is now defined as wlj,i where wlj is one possible
96  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
Join
Fork
C0 C1
Config-
Select
Input data
Output
(a) generic reconfigura-
tion model
C0 C1
Config-
Select
CE CE
Input data
Output
1 0
(b) clock-gated archi-
tecture
Figure 5.1: A model of a reconfigurable circuit, configured with Config-Select (dashed
lines in the model represent the flow of control data). The clock-gated architecture is also
shown.
configuration. Multiple configurations are selected at different accuracy levels because
the circuit can now change at run time.
If the accuracy of a variable is reduced, the output error will not always be
affected. Loops and conditional statements can have a large effect on the error; if the
conditions of these branches were to change, the error could change (section 4.2.2). In
principle it is possible for the width of an operand to increase from one configuration
to the next even though the accuracy of an output has decreased. This makes manual
circuit design difficult because the difference in width from one configuration to the
next will not be constant for every variable. The following sections discuss different
methods of reconfiguring the device based on these analyses.
5.1.2 Reconfiguration with Multiplexers
Modelling and developing reconfigurable circuits [36, 93] requires specialist develop-
ment tools [94] to make efficient use of the resources available. In the general case,
these models describe a circuit as a set of interchangeable components, C0 and C1 in
figure 5.1(a). In the general case, there could be a greater number of components,
C0...Cn−1 where n is the number of components; however, the more components there
are, the more complex the routing. Multiplexers provide a way of rapidly (cycles
as opposed to milliseconds) reconfiguring the circuit, allowing data to be moved to
the active part via the routing blocks (Fork and Join). This methodology assumes
that every component can reside on the chip at a given time. An abstract model has
also been proposed [112] in which the routing blocks are virtual. It may be the case
5.1 Methodology  97
Logic 
Element(s)
clk
Output
Input
Control 
Logic
(a) ASIC
clk
Output
Input
ce
Logic
Element(s)
Control 
Logic
(b) FPGA
Figure 5.2: Clock gating in an ASIC and FPGA (assuming no dedicated hardware exists
in the FPGA). The FPGA architecture is less efficient since the clock input is still toggling
when inactive.
that the multiplexed components do not reside on the same chip but are separate
systems.
The model is now extended to reduce energy in reconfigurable devices and
ASICs. Since only one of the regions will be active at any one time, the clock to
the inactive regions can be stopped, as shown in figure 5.1(b). A reconfiguration
controller determines when to switch between different configurations, activating and
deactivating different parts of the circuit as required. If it is not possible to stop the
clock to these regions (as explained in section 5.3.1), the input is set to constant zero
to reduce signal transitions.
If large regions of components C0 and C1 in figure 5.1 are identical, there are two
options. The first option is to treat the components independently. Although a high
performance may be achievable, a large area is often required because components
that could be shared are duplicated. This option is often used when reconfiguring the
bitstream. The second option is to share parts of the hardware and multiplex parts
of C0 and C1; this may only require multiplexing a single register. Some examples are
presented in section 5.4.1, comparing the multiplexer-based approach with multiple
bitstream reconfiguration.
5.1.3 Reducing Power Consumption in FPGAs
It is important to deactivate those components that are not in use to minimise energy.
Clock gating is a technique employed in circuits to reduce the dynamic power of
98  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
inactive components. This is accomplished by disabling the clock at a given point,
thus reducing signal transitions in a hardware component when it is not required to
be active. Figure 5.2 shows two different methods of achieving the effect of clock
gating. When the clock is gated in an ASIC, the clock tree will be switched off. In an
FPGA the clock to the gated component will still toggle, reducing the power saving.
In both cases the flip-flops in the gated logic will not toggle. Due to parts of the
FPGA having a fixed architecture, nothing can be placed between the clock tree and
the logic to be gated. Dedicated clock buffers do exist, however, these control a large
region of the clock-tree and are limited in number — 16 on a Virtex II XC2VP30.
Without a clock buffer, the clock is still toggling and therefore consuming power. In
the case that the flip-flop does not have a clock enable, a multiplexer can be used:
the clock enable acts as the control, one of the inputs allows new data to be loaded,
and the output is fed into the second input, resulting in an unchanged output while
the clock enable is zero.
5.1.4 Combining Reconfiguration Approaches
Several steps must be completed in order to combine different methods of reconfig-
uration — bitstream reconfiguration and multiplexer-based reconfiguration. The
methodology can be summarised in the following steps.
1. The circuit is divided into blocks, one for each phase. Word-length analysis
is one method of partitioning the design; different configurations are created
based on the accuracy required.
2. The design is refactored to avoid duplicating components, minimising area
and reconfiguration overheads — weighted bipartite graphs [112] have been
used to solve this problem. The use of bitstream reconfiguration allows each
configuration to be efficiently optimised independently as well as combining
multiple configurations into a single system. When multiplexing components,
it is likely that optimising a group of configurations together will prove more
effective.
3. New architectures which simplify chip design [7] (in which the placement of
different configurations is not restricted to identical regions of the FPGA) and
online routing techniques [105] can be incorporated (whereby routing logic can
5.2 Reconfiguration Conditions  99
be altered based on run-time conditions and requirements) if partial bitstream
reconfiguration is available.
4. A controller determines when to switch between different configurations. If
there are a large number of configurations, multiplexers can be replaced by
dedicated decoders, as is the case in a general purpose processor, to reduce
the area of the controller. Calculating the optimal number of controllers to
minimise energy is an important aspect of the problem, particularly in large
system.
In the general case, power-efficient designs can be produced by clock gating the
deactivated parts or by using multiple bitstreams in which the deactivated parts have
been eliminated. In the next section the conditions under which each reconfiguration
strategy should be employed are discussed. The constraints are outlined in section 5.3
with an analysis showing when each strategy should be selected for a given device.
5.2 Reconfiguration Conditions
In many applications, such as ray tracing and feature extraction, the algorithm’s
control-flow is dependent on the input. This means that the functionality of the
system will change as the input changes, so static word-length optimisation will have
a reduced effect. As the system functionality changes, the operator widths should be
allowed to change. Based on stimuli, which may come from outside the system or be
generated by the system, the word-lengths will adapt in such a way as to reduce the
power consumption of the system while keeping the error to a minimum.
The multiplexer-based approach often requires a large amount of power while
running a task because the entire design resides on the chip, regardless of whether
it is active at a particular time. Bitstream reconfiguration can be used to produce
efficient systems but they take longer to adapt to changing conditions and incur
a temporary increase in power consumption while reconfiguring; for this reason a
model to determine the most efficient reconfiguration strategy is used.
5.2.1 Performance
As well as increasing the throughput of a system, it is also important to minimise the
energy required, especially if such systems are mobile. Each method of reconfiguration
100  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
Notation Description
teb and tem Time taken to complete part of the task between two reconfig-
urations (bitstream and multiplexer-based respectively).
tb and tm Time spent reconfiguring the system, either by modifying the
bitstream or multiplexing components.
peb and pem Power required to complete part of the task when bitstream
reconfiguration and multiplexer-based reconfiguration are em-
ployed respectively.
pb and pm Power required to reconfigure the system, either by modifying
the bitstream or reconfiguring the multiplexers.
Table 5.1: A summary of the notation used in the reconfiguration model.
has advantages and disadvantages. To determine the most efficient reconfiguration
strategy at a given instant, the characteristics of the application — the reconfiguration
time and associated power required, along with the run time and power required
between reconfigurations — are analysed over a period of time.
The notation used in this model is shown in table 5.1. The total run time when
employing reconfigurable multiplexing and multiple bitstream reconfiguration is
obtained by summing the respective elements, the time taken to complete part of
a task and the reconfiguration time. It is assumed that the reconfigurations are
uniformly distributed over the period of time analysed. If they were not, the different
approaches could be combined to obtain a more efficient solution.
Multiplexers can be reconfigured in as little as a single cycle providing a way of
rapidly modifying a circuit, although it may take time for results to be generated
depending on the latency of the system; bitstream reconfiguration takes many times
longer [53], hence, tm < tb. The time to complete a task, however, tends to be
greater when multiplexers are adopted to reconfigure the system, tem > teb, because
it is often the case that an alternative bitstream configuration exhibits an improved
performance. A design employing bitstream reconfiguration is likely to have a higher
clock frequency because it has a smaller area. Place and route tools are therefore
more likely to be able to pack elements close together. In some cases, the clock
frequency of a multiplexed design can be increased by using a different clock frequency
for each component (figure 5.1) — the number of clock domains is increased; this
will, however, only be applicable if a coarse-grain approach is adopted.
Figure 5.3(b) shows the power reduction of a B–splines benchmark with simulated
inputs: 8-bit counters combined to form larger variables, providing a uniform toggling
5.2 Reconfiguration Conditions  101
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 10000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s / F
l i p
- F
l o p
s ]
Precision [bits]
flip-flops
luts
(a) area
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 2000
 5  10  15  20  25  30
P o
w
e r
 [ m
W
]
Precision [bits]
Power
improvement
Power loss
multiplexed-max
multiplexed-reduced
bitstream reconfig
(b) power consumption
Figure 5.3: Area and power consumption of the B–splines benchmark on the Xilinx
XC2VP30 with varying output accuracy. Two designs are shown in figure 5.3(b). The first,
bitstream reconfig, contains no on-chip logic to reconfigure the circuit. The second design
is capable of running at two different accuracy levels, multiplexed-max and multiplexed-
reduced. The power loss is a result of the extra circuitry required to run at full precision.
Bitstream reconfiguration results in the smallest, most power-efficient design, however, a
multiplexed design is quick to reconfigure.
input over the length of the variable. Each arithmetic operator calculating the same
function has the same number of pipeline stages. The effect of the pipeline structure
is not investigated here so the number of registers may need to change depending on
the most important cost metric, area or power consumption.
The graph shows a comparison of two designs. The first, bitstream reconfig, is
one which has been optimised by reducing the width of operators, for example,
multipliers, in such a way that it satisfies the error requirement given on the x-axis.
A precision of x bits means that the error on the output is less than or equal to 2−x.
Figure 5.3(a) shows the corresponding area.
This is compared with the second design, capable of running at two precisions:
the precision given on the x-axis, multiplexed-reduced, and 32 bits of accuracy,
multiplexed-max. The first design is one that would be employed if the bitstream
were reconfigured to the optimal circuit, the second is one in which the entire circuit
remains on chip and part of it is deactivated to achieve a reduced precision. Since
the entire circuit, including the inactive part, remains on-chip, it is larger than
one developed with the minimum required logic to complete the task at any given
instant, resulting from bitstream reconfiguration; hence, the multiplexer-based design
consumes more power while performing a task at a given precision. This difference
is shown in figure 5.3 as Power loss. As the maximum precision is approached, the
102  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
fluctuation caused by the place and route tools and chip temperature [6] becomes so
great that the difference in power consumption between the two circuits is not as
pronounced. The design employing a single precision will consume less power than
the design capable of selecting several precisions. This is because the active logic is
performing the same task, and the inactive logic in the larger design is consuming
power, largely due to the clock tree still toggling.
All designs operate with the same number of pipeline stages. In some cases it
may be preferable to reduce the number of pipeline stages for designs employing
bitstream reconfiguration to reduce the area. This will affect the reduction in power
caused by reconfiguring the design as opposed to keeping the entire design on-chip.
Overheads such as power dissipation of the controller required to select a precision
are not included. Any Block RAM will not have its precision reduced because all
values must be stored at their maximum precision; there will therefore be a smaller
power improvement.
The energy used by the multiplexer approach, which includes task execution and
any reconfiguration that occurs, is:∑
(tem × pem) +
∑
(tm × pm)
The energy of a reconfigurable design with multiple bitstreams is calculated in the
same way. A summation is used to show that there may be multiple reconfigurations.
As stated above, it is assumed that the reconfigurations are uniformly distributed.
If they were not, the different approaches to reconfiguration could be combined to
achieve a more efficient solution. Reconfiguring the multiplexers is usually fast (often
measured in cycles) and the power required is low, so the reconfiguration energy
is often ignored. Multiplexers will provide a more energy-efficient solution than
bitstream reconfiguration when:
(tem × δp) + (tmb × peb) < (tb × pb)
where tmb = tem − teb, the difference in run time between the two approaches, and
δp = pem − peb; δp can be obtained from figure 5.3, Power loss.
In this case, both circuits operate at the same frequency. The average time,
te, before one method of reconfiguration becomes more efficient than another can
be calculated. The more measurements (or estimates) that are taken, the more
5.2 Reconfiguration Conditions  103
 0.98
 1
 1.02
 1.04
 1.06
 1.08
 1.1
 0.15  0.2  0.25  0.3  0.35  0.4  0.45
R
u n
 T
i m
e  
[ s ]
Reconfiguration Time [s]
multiplexed
bitstream reconfig
bitstream reconfig incf.
(a) run time
 200
 210
 220
 230
 240
 250
 260
 270
 280
 290
 300
 0.15  0.2  0.25  0.3  0.35  0.4  0.45
E n
e r
g y
 [ m
J ]
Reconfiguration Time [s]
multiplexed
bitstream reconfig
(b) energy
Figure 5.4: Variation in run time and energy for an inner-product design with a word-
length of 20 bits. If both designs must finish at the same time, the frequency of the
design employing bitstream reconfiguration must be increased due to the reconfiguration
overhead. The model is used to determine the scenarios under which bitstream reconfig-
uration is more efficient than multiplexing given that both designs must finish at a given
time (reconfiguration time and power are estimated as explained in section 5.3.3).
accurate the model is likely to be. Phase analysis can be used to determine when a
reconfiguration should occur if it is not clear [69]. In this model:
te <
tb × pb
δp
(5.1)
for designs employing multiplexers to be more efficient (illustrated in figure 5.6).
A reconfiguration schedule can be produced based on how te varies over time [17].
From figure 5.3, given te, which depends on the application, the average precision
can be calculated to suit a given method of reconfiguration.
5.2.2 Reconfiguration Interval
The run time is dependent on the application; specifically, the frequency and distri-
bution of reconfiguration. If the circuit is reconfigured frequently, the reconfiguration
time will have a more pronounced effect on the total run time. Since multiplexer-
based reconfiguration tends to take less time, it is likely that it will be more efficient
than bitstream reconfiguration in this case even though the circuit may not be as
efficient at a given instant. If the circuit is reconfigured many times over one period
of time and fewer times over another period of time (of the same length), both
methods of reconfiguration can be adopted to produce an efficient system; hence the
need to look at the distribution of reconfigurations.
104  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
The reconfiguration interval, that is, the length of time between reconfigurations
occurring, affects power consumption and the time to complete a task. Although the
intervals have a certain distribution, for simplicity, it is assumed that the intervals are
uniformly distributed. Since multiple bitstream reconfigurable designs do not contain
as much logic on-chip, they can sometimes run more quickly. If the designs could
run at different frequencies (xMHz and yMHz, where x < y), the energy consumed
for multiple bitstream reconfiguration to be more efficient would be:
(pb × tb) + (δrp× t′eb) < (δp× tem) + (tb × peb)
where δrp is the difference in power consumption at the different frequencies and t′eb
is the time between reconfigurations at the higher frequency. If both designs must
complete at the same time:
t′eb = tem − tb
y =
x× tem
t′eb
Figure 5.4 shows how the reconfiguration interval affects the run time and energy
requirements. The graphs show that a reconfiguration interval of more than 0.35
seconds means that a design employing bitstream reconfiguration will be more efficient
than using multiplexers to reconfigure the device. To estimate power consumption at
an arbitrary clock frequency, power consumption is measured at different frequencies
and interpolation used by separating the active and inactive components of the power
consumption.
If this method is used alone, the energy will not increase significantly by increasing
the clock frequency; the power will increase but the run time will decrease. In practice,
it may increase because additional hardware may be required.
5.3 Reconfiguration Strategy
In order to allow a circuit to adapt to different input conditions, it can be reconfigured
— either the bitstream can be modified or different components can be multiplexed.
Each variable is allowed to have a different accuracy, calculated based on the accuracy
requirement and altered based on the reconfiguration strategy. If a lower accuracy is
required, the design can be reconfigured. Using a lower accuracy can have the same
effect as reducing the bandwidth, accelerating the application.
5.3 Reconfiguration Strategy  105
5.3.1 Multiplexer Reconfiguration
Component multiplexing [36, 93, 122] is fast (cycles as opposed to milliseconds),
however, the major disadvantage is the power consumed by the inactive components.
One option to reduce power consumption is clock gating. Clock gating provides a
means of reducing switching activity by disabling registers from reading external
data; dynamic power consumption is therefore reduced. Clock gating is supported
by current FPGA devices through the use of dedicated on-chip resources. The Xilinx
Virtex series of devices contain a clock gating block, BUFGCE, which provides a
way of turning a global clock tree on or off [80], resulting in power savings both from
the clock tree and the components attached to it. However, the number of such
clock gating blocks may be limited — there are only 16 BUFGCE elements on a
Xilinx XC2VP30 FPGA and fewer on smaller devices such as the Xilinx XC3S500E.
If more are required, the clock-enable input can be used to gate the register in each
reconfigurable logic element. This method can be used to support reconfigurable
word-length optimisation, however, to simplify the designs and reduce the area, the
inputs are multiplexed.
It may be possible to group several components together so that they share a
single clock gating control, such that the number of components matches the number
of specialised clock gating blocks on a given device. Due to the number of functional
units that are used, this approach is not adopted. A coarse-grain approach using the
clock-enable input on each core could also be employed.
Zhang et al. [133] analyse the effect clock gating has on dynamic power consump-
tion, showing that FPGAs, although not as efficient as ASICs, can achieve significant
power reductions. The authors use power estimation tools that employ event-based
gate-level simulations because they are more accurate than high-level simulations [57]
and faster than switch-level simulations [86]. Dynamic power reductions of between
45.1% and 87.1% are shown. Benini et al. [10] present a technique for automatic-
ally synthesising logic for gated clocks. When their technique was applied to the
MCNC [132] benchmark suite, the average power reduction was estimated to be 25%
with a 5% increase in area. Both of these techniques show that large power savings
may be possible by gating the clock, although power consumption is estimated. It is
not always possible to save power using gated clocks. Cadenas et al. [21] use a clock
gating technique in a pipelined Cordic core with the goal of reducing bit-switching
but do not achieve an improvement.
106  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
As mentioned in section 5.1, word-length optimisation is used to determine where
to save power. Modifying the width of operators is a common method of reducing
power consumption of hardware designs. Brooks and Martonosi [18] show that often,
the full width of an arithmetic operator is not required. Rather than simply removing
part of an operator, clock gating is employed to reduce the power consumption when
not in use, resulting in power savings of between 45%-60% for the SPECint95 and
MediaBench benchmark suites. The authors also suggest that operators be packed
into single units to increase performance. A similar method is applied to dynamically
reduce the width of subtraction operators [101]. This strategy applies to the most
significant bits of an operator. Reducing the precision requires a complex analysis
(chapter 4) and in some cases can lead to better results because the least significant
bits of a variable tend to switch more rapidly.
Given that word-length optimisation requires more dedicated clock gating elements
than are available, methods of producing a similar effect are investigated. One
approach is to connect several smaller cores together, for example, multipliers, to
make a larger one. Some are then disabled using the clock enable port to reduce power
when the accuracy of the full core is not required. This method can be inefficient
because one large core can be optimised more fully than several independent cores
connected together. Since only the clock enable port on flip-flops are utilised to save
power, the clock input is still toggling; more power is therefore consumed than an
ASIC (in which the clock can be completely gated). Another option is to use a more
global clock gating technique whereby entire arithmetic operators are selected, for
example, multipliers with different precisions, instead of gating individual bits; this
is not done here because it requires a large area.
To reduce the area overhead of multiplexing components, the lower-order bits of
the input are set to zero to minimise signal transitions in the circuit. This produces
a similar effect to clock gating the unwanted bits. Glitches — unwanted signal
transitions as a result of the system having not reached a stable state — are a key
contribution to power loss in electronic systems. For this reason, methods of reducing
them have been researched [85]. As a consequence of reducing signal transitions on
the least significant bits, glitches are minimised, reducing power.
Figure 5.5 shows the power consumption of a 64-bit multiplier running at 100MHz
with different precisions, changed by setting the unwanted bits to zero. There is a 1%
area overhead because the inputs must be multiplexed. This overhead can increase
5.3 Reconfiguration Strategy  107
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 10  20  30  40  50  60
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed-max
multiplexed-reduced
bitstream reconfig
(a) soft logic multiplier
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
 10  20  30  40  50  60
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed-max
multiplexed-reduced
bitstream reconfig
(b) embedded multiplier
Figure 5.5: Power saving for a 64-bit multiplier (constructed using VHDL and Core-
gen 10.1) by reducing the precision. As the width of the multiplier increases, the number
of toggling bits increases and the power increases. Additional logic is included to multiplex
between input widths.
in larger designs as routing becomes more complex. Figure 5.5(a) shows how a soft
multiplier built from LUTs is affected by signal transition rates; figure 5.5(b) shows
how an embedded multiplier is affected.
5.3.2 Bitstream Reconfiguration
An alternative to multiplexing is bitstream reconfiguration [55, 123] in which the
entire chip, or part of it [113], is reconfigured at run time. With a smaller design
on the chip, the power consumed will be lower, assuming that parts of the device
can be deactivated when not in use. The disadvantage is that there are large energy
overheads associated with the reconfiguration process.
Bondalapati and Prasanna [17] use reconfiguration to increase the performance of
a circuit by allowing the width of each variable (or a subset of variables) to change
at run time (section 2.3.1). If the chip cannot be reconfigured or the overhead
is too high, another approach using multiplexers and demultiplexers [36, 93, 122]
can be used. This method supports fast reconfiguration but requires a large area
because the chip must be constructed or configured in such a way as to include
every component; components cannot be added using this approach alone. Bitstream
reconfiguration allows new components to be installed, giving the illusion that the
chip is larger than it actually is because components can be moved on and off the
chip at run time. This assumes that only a subset of the components will be used at
any given time [129]. Mignolet et al. [99] have developed a video decoder that allows
108  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
tasks to be executed on a processor or converted to a hardware circuit. To achieve
this, a software library [125] has been created which allows tasks to be described in
a high-level, architecture-independent way and then transformed at a later stage.
The ability to switch between different components in a flexible, efficient manner is
becoming more important as hardware circuits grow more complex [9]. Improving
area utilisation also allows power efficient hardware to be created by reducing the
amount of logic on-chip at any given time [89]. To accomplish this, the system is
allowed to adapt to changing conditions. In such situations it is important to reduce
the size of any controller that is required.
One of the problems with run-time reconfiguration is the complexity of the tools
needed for development. McMillan et al. [96] have extended the Xilinx JBits API to
provide an abstract way of reconfiguring a device. This means that the approach
may be more widely adopted. In some cases, only part of the system needs to change.
In order to allow the components of a system to be replaced while the system is in
use, partial run time reconfiguration can be used. Methods of reducing the time
to create these configurations and simplifying the development process have been
proposed [66, 74]. As well as increasing the use of reconfiguration, a simplified
development process will reduce the cost of a system.
5.3.3 Comparing Reconfiguration Strategies
Given that there is a trade-off between reconfiguration time — long when reconfiguring
the entire bitstream, short when only multiplexing components — and energy
required between reconfigurations — low when reconfiguring the design but high
when multiplexing because the entire design resides on-chip — the average run
time before one strategy becomes more efficient than the other must be quantified.
Figure 5.6 shows how the average run time given by equation (5.1) varies with output
precision. It indicates how long it takes before the overhead of reconfiguring the
design by multiplexing components becomes more costly, in terms of energy, than
reconfiguring the bitstream. The model is applied with the following assumptions:
14ms [53] is used for the average reconfiguration time and 1500mW [5] for the average
reconfiguration power. The overheads of using multiplexers are taken from the
designs in section 5.4. Although this may be an underestimate of reconfiguration
time if the entire chip is reconfigured, the graphs still show that leaving the entire
design on-chip can be more efficient. The same estimate is used for two chips, Xilinx
5.3 Reconfiguration Strategy  109
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
(a) inner product
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
(b) vector multiplication
Figure 5.6: Average run time above which multiple bitstream reconfiguration becomes
more efficient than multiplexing for a pipelined 6-element inner product and pipelined
8-element vector multiplier on a Xilinx Virtex II Pro XC2VP30 FPGA. In general, as the
size of a multiplier increases, the difference in area compared with a constant multiplier of
the same size increases (section 5.4.1, figure 5.7). The same is true of power consumption
(section 5.4.1, figure 5.8). When using bitstream reconfiguration, constant multipliers can
be employed; if the constants change, the chip can be reconfigured. Constant multipliers
are not used when multiplexing; since any constant can be input and bitstream recon-
figuration is not being employed, multipliers must be used. As the size of the operators
increase (indicated by the Precision), the difference in power increases. It will therefore
become more efficient to reconfigure the chip. Modifying the bitstream becomes more
efficient as the run time between reconfigurations increases; the longer the power loss is
sustained for, the more inefficient the design employing multiplexing (which makes use of
multipliers as opposed to constant multipliers) becomes.
XC2VP30 and XC3S500E. For more accurate estimates, the reconfiguration power
consumption overhead should be measured for each specific device.
The greatest disadvantage of reducing the word-length (chapter 4) at run time
by reducing the toggle rate of the least significant bits is the decrease in efficiency as
the size of the reduction increases. This is because the clock connected to the unused
bits is still toggling. In conjunction with word-length optimisation, figures 5.6(a)
and 5.6(b) show how multiplexing constants compares to the overhead of bitstream
reconfiguration (section 5.1.2). In these cases, either a set of constant multipliers are
used in the multiple bitstream approach or standard multipliers with multiplexed
constants. As the accuracy requirement increases, the constant multipliers become
more efficient, favouring the multiple bitstream approach, despite its reconfiguration
overhead.
If the power overhead (figure 5.3) and time between reconfigurations are known,
a constraint on the maximum allowable reconfiguration energy can be created.
110  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 10000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
multiplexed
bitstream reconfig
(a) inner product
 0
 2000
 4000
 6000
 8000
 10000
 12000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
multiplexed
bitstream reconfig
(b) vector multiplication
Figure 5.7: Area against word-length for a pipelined constant 6-element inner product and
pipelined constant 8-element vector multiplier using multiplexed constant inputs compared
with those using a single constant input on the Xilinx XC2VP30.
5.4 Results
All designs are synthesised using Handel–C 5 and Xilinx ISE 10.1. The power
consumption is measured by attaching an ammeter to the 1.5V VCCINT jumpers
(on the XC2VP30) and the 1.2V VCCINT jumpers (on the XC3S500E) that supply
power to the FPGA. All designs run at 100MHz in order to compare their power
consumption.
5.4.1 Inner Product and Vector Multiplication
Figures 5.7 and 5.8 show how area and power vary with target precision for designs
using multiplexed constant coefficients (see section 5.1.2) and designs using multiple
bitstreams, optimised by constant propagation. The difference in area and power
between the design dedicated to a single set of constants and one capable of using
multiple sets of constants grows as the word-length increases. The larger this gap,
the more favourable bitstream reconfiguration is because the power loss caused by a
larger than required design is so high that the reconfiguration overhead becomes less
significant. Figure 5.6 shows how long it takes for the energy loss to become greater
than the energy required to reconfigure the chip.
5.4.2 Uniform Cubic B–Splines
B–splines equations are applied in image deformation applications [71] and can be
accelerated with dedicated hardware, such as FPGAs. Figure 5.3 shows the power
5.4 Results  111
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 2000
 5  10  15  20  25  30
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed
bitstream reconfig
(a) XC2VP30 inner product
 0
 500
 1000
 1500
 2000
 2500
 5  10  15  20  25  30
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed
bitstream reconfig
(b) XC2VP30 vector multiplication
 50
 100
 150
 200
 250
 300
 350
 5  10  15  20
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed
bitstream reconfig
(c) XC3S500E inner product
 0
 50
 100
 150
 200
 250
 300
 5  10  15  20
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed
bitstream reconfig
(d) XC3S500E vector multiplication
Figure 5.8: Power consumption against word-length for a constant inner product and
constant vector multiplier using multiplexed constant inputs compared with those those
using a single constant input.
consumption and area of the B–splines circuit. Section 5.2.1 describes the associated
power overheads that come with fast on-chip reconfiguration and the implications of
choosing a reconfiguration strategy. The B–splines design follows a similar pattern to
that of the ray tracer described in section 5.4.3. Although there is a large power loss
caused by keeping the entire design on-chip there is also a large saving by combining
word-length optimisation with the approach of modelling reconfiguration.
5.4.3 Ray Tracing
The bottleneck in most ray tracers is the ray-object intersection. For every ray,
it must be determined whether it will intersect with an object or be unaffected.
The less precise the arithmetic operators, the lower the image quality produced.
If the circuit were created to cater for less accurate image creation, the precision
could be greatly reduced to conserve power. Since the ray tracer should be able to
112  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
 4  6  8  10  12  14  16  18  20
A r
e a
 [ L
U T
s / F
l i p
- F
l o p
s ]
Precision [bits]
flip-flops
luts
(a) area
 1500
 2000
 2500
 3000
 3500
 4000
 4  6  8  10  12  14  16  18  20
P o
w
e r
 [ m
W
]
Precision [bits]
multiplexed-max
multiplexed-reduced
bitstream reconfig
(b) power consumption
Figure 5.9: Area and power consumption of the ray tracer with varying output accuracy
on the Xilinx XC2VP30. As with the B-splines benchmark, the power loss caused by
multiplexing is large, however, the power improvement is also large. If an ASIC were
used, the power loss from the clock tree could be eliminated, although the routing overhead
would increase (also affecting power) compared with a circuit not employing gated clocks.
deal with any scene, the word-lengths are reduced at run time to minimise power
consumption. Figure 5.9(a) shows the area of the ray tracer at different output
precisions (section 4.3). Figure 5.9(b) shows the dynamic power reduction of the ray
tracer at different output precisions for the multiplexer-based approach and multiple
bitstream approach.
Although the power loss is large, when the reconfiguration interval is lower than
0.1 seconds, keeping every component of the design on-chip will reduce energy. Since
this is a large circuit, the estimate for reconfiguration time may be too small. The
larger the reconfiguration time, the more beneficial multiplexing becomes. Power
reductions of 3.7% per bit and 2.7% per bit are shown for bitstream reconfiguration
and multiplexing respectively — a large power reduction, highlighting the need for
run-time reconfigurable circuits.
Combining word-length optimisation with multiplexing can result in a larger
energy improvement if dedicated resources are utilised because they are more efficient
than soft logic [84]. The power loss will therefore be smaller, increasing the time
before bitstream reconfiguration will become more efficient (figures 5.6 and 5.5).
5.4.4 Analysis
Figure 5.6 (section 5.3.3) shows the average execution time required before bitstream
reconfiguration becomes more efficient than keeping the entire circuit on-chip and
multiplexing components — when the reconfiguration interval increases above the
5.4 Results  113
specified value, bitstream reconfiguration is more efficient. If the reconfiguration
interval continuously changes, a controller is required to determine when to reconfigure.
There are several possibilities [17]:
Do not reconfigure for the duration.
Calculate a reconfiguration schedule to reduce energy.
Determine when to reconfigure at run time.
Combining different reconfiguration strategies can reduce reconfiguration time and
energy. Multiplexing offers fast reconfiguration, although the throughput may be
reduced because the circuit will be larger. Bitstream reconfiguration results in a
smaller circuit but it will not produce results while reconfiguring.
Dividing an application into a set of phases can increase performance and reduce
area [119]. Branch probabilities have been exploited to reduce circuit area. If a
branch is executed infrequently, it is possible to reduce the amount of logic. Area
improvements of up to 27.5% are shown [118] for progressive refinement radiosity.
As the scenario — in this case, characterised by a set of branch frequencies —
changes, the area improvement can change. In other words, as the branch frequencies
change, such as the number of loop iterations, the accuracy is likely to change.
The model described in this chapter thus improves energy consumption over static1
approaches [60, 87] because the circuit can be reconfigured rapidly based on iteration
counters. It is shown that reducing the accuracy of a B–splines circuit by 25%
results in a power reduction of 31% on a Xilinx Virtex II Pro FPGA (section 5.2.1,
figure 5.3). If multiplexers are selected to perform the same task there is a 20%
reduction in power. If the reconfiguration interval is smaller than 0.097 seconds,
multiplexing becomes more efficient than work employing bitstream reconfiguration
alone. Dedicated hardware, such as embedded multipliers, can be used to increase
this time (figure 5.5(b)).
The greatest drawback of multiplexing is the power loss caused by not switching
the clock tree off when it is not in use. An FPGA clock tree has a limited amount
of flexibility; only large regions can be switched off. If an ASIC were developed,
this would not occur to the same degree because the clock tree in an ASIC can be
1A circuit that does not adapt to run-time conditions is referred to as static.
114  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
deactivated at any point; although once created, it cannot be modified. Despite this,
energy will still be reduced in an FPGA provided that the time between successive
reconfigurations is small. The approaches outlined above [60, 87] do not make use of
multiplexing and may therefore incur a large reconfiguration penalty (both time and
power).
5.5 Proposed Model of Computation
Based on the large power reduction gained by reconfiguring part of a circuit using
word-length optimisation at run time, a new model of computation is proposed to
make use of accuracy reduction in software applications. It has been shown [101]
that processor power can be reduced by decreasing the switching activity of the most
significant bits; the proposed approach targets the least significant bits. Several
factors must be addressed before word-length analysis can be applied to processors:
The granularity of word-length reduction. This concerns the number of instruc-
tions that will be affected by a reduction in word-length.
The size and number of levels of reduction; this will affect the amount of
additional logic required to support such a system.
It is proposed that additional instructions be added to a custom processor to reduce
the width of the floating-point arithmetic at run time. Each thread will have to store
the level of accuracy such that when a context switch takes place, the accuracy can
be modified. This approach could be particularly useful in mobile devices because
they could reduce accuracy when running on battery power. If custom processors
were used, they could be reconfigured to support an arbitrary word-length without
the need for additional logic. Figure 5.10 shows two circuits to control the switching
activity of a function. Configuration (a) is able to switch between full precision
and reduced precision based on a control input. Configuration (b) is able to switch
between three precision widths but requires additional logic.
Part of an operator could be switched off as opposed to having its switching
activity reduced at the input. As discussed in section 5.3.1, this may not be possible
in an FPGA. To approximate this, the control input could be routed to the clock
enable on the flip-flops used to calculate the least significant bits. This would
5.5 Proposed Model of Computation  115
c
b0
0
b1
0
b2
...
bn-1
(a)
c0
b0
0
b1
0
b3
...
bn-1
b2
0
...
c1
(b)
Figure 5.10: Two circuits to reduce switching activity. Multiplexers are placed between
the input and the function to stop the input toggling. Configuration (a) has one control
input enabling it to switch between full precision and reduced precision. Configuration
(b) is capable of switching between three different precision widths; the controller will use
more power (and the routing overhead will increase) but the variable can have its precision
more precisely controlled. If an ASIC were used, the clock tree could also be switched off
to reduce power.
increase the amount of routing hardware and may reduce the clock frequency; power
consumption may also increase when the circuit is operating at full precision. This
approach can be realised in an ASIC by gating the clock. The ideal situation would
be a reconfigurable device with a fine-grain clock gating mechanism. The clock tree
would have to be capable of being switched off at different points, arranged in such a
way that the least significant bits of an operator could be gated. A compromise that
may be employed is to use dedicated multipliers with this capability, thus combining
the advantages of an ASIC (little or no power loss) with those of an FPGA (flexible
reconfiguration). Although the number of bits that could be switched off would be
fixed, the hardware area would be small. To approximate this, embedded multipliers
could be connected together (figure 5.5(b)). An example of how this might be used
is shown in figure 5.11 which introduces an additional library function to reduce
precision.
116  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
1 float a;
2 unsigned int iterations , i;
3 ...
5 if (iterations < 1000)
6 reduce_precision();
8 while (i < iterations)
9 {
10 a = a * 2;
11 ...
12 i++;
13 }
Figure 5.11: Example source code showing how to reduce energy. The behaviour of loops
and conditional statements commonly affects the precision of variables. In this case, as
the number of iterations increases, the error on a increases. A decrease in the number
of iterations may mean that the same output error can be obtained at a lower precision,
reducing power.
Combining the approach with bitstream reconfiguration reduces the number of
phases handled by the device. For example, the approach outlined in this chapter
switches between two precision widths. This means that a simple controller is used
to reduce power. Allowing the controller to switch between an arbitrary number of
phases increases the amount of logic and routing becomes more complex, reducing
the power saved. If bitstream reconfiguration were used, the controller could handle a
large number of phases with a small amount of logic, provided that only a small subset
of the total number of phases were active over a short period of time. The model
outlined can determine how short this length of time must be before reconfiguring the
entire chip becomes impractical. Given that conditional statements commonly cause
a word-length to change, the model can be extended to reduce power consumption
of the controller. Realising the controller as a circuit will be more efficient than
a set of instructions that will require decoding (which could be recoded to reduce
energy [42]). The source code to use this model is shown in figure 5.12. Methods to
reduce precision take as input a variable to be monitored. A comparator is realised
as a hardware circuit.
A disadvantage of this approach is the difficulty of controlling which variables
have their precision reduced. It may be the case that only a subset of variables
should have their precision reduced to maintain accuracy. Changing the precision
of multiple variables rapidly will cause a high rate of toggling. For this reason, it
5.6 Summary  117
1 reduce_precision(iterations); reduce_precision(c);
3 while (i < iterations) if (c)
4 { {
5 a = a * 2; a = a * 2;
6 ... ...
7 } }
Figure 5.12: Example source code to further reduce energy by modifying the circuit to
incorporate the controller which determines when and how to reduce precision. The pre-
cision is reduced based on a trigger. This can be applied to loops (left) and conditional
statements (right).
may be beneficial to calculate uniform precision widths to reduce the power required
by the controller. As shown in chapter 4, a uniform precision often overestimates
precision and power consumption [1].
5.6 Summary
In many applications, as the scenario changes, the requirements of the system change,
potentially leaving parts underutilised [69]. This chapter presents an approach to
developing run-time reconfigurable hardware. Three aspects that must be considered
when designing a system capable of adapting based on input stimuli are as follows.
1. The location of the components that need to change. Word-length optimisation
is used in this case (described in chapter 4; extending [17]).
2. The reconfiguration strategy. Two approaches are used: multiplexer-based
reconfiguration [36] and bitstream reconfiguration [66].
3. The frequency of reconfiguration. A model has been created to determine
which reconfiguration strategy to use based on the frequency of reconfiguration
and size of the component being reconfigured.
A model has been developed to determine which reconfiguration strategy produces
the most energy-efficient circuit given the reconfiguration frequency required. It has
been shown that word-length optimisation can be used to reduce the power consumed
by hardware circuits and combined with the reconfiguration strategy, determined by
the model, to reduce the energy used by the system.
118  Chapter 5: Energy Reduction by Systematic Run-Time Hardware Deactivation
Current and future work includes extending the proposed approach to cover recon-
figurable designs that make use of partial bitstream reconfiguration and determining
whether the design time will be significantly increased as a result. Investigation into
the types of controller required to determine which reconfiguration strategy to use
and their power consumption is being undertaken; phase information will be analysed
to accomplish this [119]. Changing the number of stages in a pipeline causes the
power consumption to change [128]. This effect will be analysed to find the most
efficient way of constructing the pipeline to increase the power saved by switching
off bits.
CHAPTER 6
Summary and Conclusions
This chapter concludes the work and discusses extensions currently being undertaken.
There are three key challenges when solving computationally demanding problems:
increasing performance, reducing energy and reducing development time. In this
thesis an approach is presented for developing area- and energy-efficient hardware
circuits rapidly, based on the accuracy required.
To ensure that the accuracy of a software application is guaranteed, a set of
linear equations is constructed to represent the problem and then solved to find the
optimal representation of each node in the data-flow graph. It is shown that mixing
different numerical representations can reduce the area of a hardware circuit. In
some cases the error on the output must be guaranteed. Word-length analysis is used
to reduce the power consumed by circuits and combined with a model developed
to determine the most energy-efficient design. Two methods of reconfiguring a
circuit — multiplexer-based reconfiguration and multiple bitstream reconfiguration —
are combined. A model selects which method to use based on the reconfiguration
frequency.
This thesis has three contributions:
An integer linear programming (ILP) formulation designed to select the op-
timal data representation of each operator in a circuit based on architectural
constraints (chapter 3).
Compile-time word-length optimisation to reduce the area and power using
scalable, aggressive heuristics (chapter 4).
Energy reduction based on systematic hardware deactivation (chapter 5).
Limitations of each of the approaches presented are also discussed. In most cases
the research can be applied to arbitrary applications, however, certain groups of
applications are unlikely to benefit.
120  Chapter 6: Summary and Conclusions
6.1 Combining Multiple Data Representations
Several problems exist when selecting the optimal1 numerical representation. These
have been addressed in chapter 3. To summarise them:
1. Using a single representation for every node in the data-flow graph can be
inefficient.
2. Sub-optimal solutions to problems are often generated because it takes too long
to find the optimal solution [87]. It is unclear how effective different algorithms
will be with regards to how close the solutions are to optimal and how long
they will take to complete.
3. Creating hardware designs using multiple number systems is time-consuming
given that each arithmetic unit could be a different size and numerous con-
straints may need to be compiled into either an ILP formulation, simulated
annealing algorithm or heuristic.
1 The first contribution is an approach to optimising the representation and width
of arithmetic operators in a software application to enable it to be efficiently realised
as a hardware circuit (section 3.2). Combining different representations gives a
reduction in area of up to 15%. The starting point is a C++ design using single and
double precision floating-point and integer types; the output is a hardware circuit
with the optimal selection of number systems for arithmetic operators, given device
constraints such as the number and type of dedicated resources available. In order
to select the representation, for example, fixed-point, the width is calculated — the
number of bits must be high enough to match the accuracy of the floating-point
unit. This extends [62, 131] in which the accuracy of a floating-point algorithm is
not guaranteed but reduced to decrease the area. A common approach is to combine
floating-point multipliers and fixed-point adders. Hardware to convert from one
format to another is included to ensure that an operator in one format does not use
operands of a different format. The benefit of analysing each operator individually is
that error does not have to be propagated. This means that the area can be reduced
even though some operators are unrecognised. If enough dedicated resources exist
on the device, a reduction in area of 11% can be achieved. The improvement over a
1Optimal with regards to area, power etc.
6.1 Combining Multiple Data Representations  121
fixed-point design is much higher and may increase further if the required accuracy
were to increase.
2 It is important to produce optimal results, but it is also important to obtain
results quickly. An integer linear programming (ILP) formulation of the approach
is used to generate optimal solutions under resource constraints. ILP works well
on small designs but it is not scalable. For this reason the results are compared
to those generated with simulated annealing, an algorithm that is not guaranteed
to produce optimal solutions (section 3.3). The algorithm selects valid solutions
quickly, obtaining a solution in less than a second. In this thesis it is shown that the
solutions are near-optimal (optimal in most cases).
With advances in reconfigurable chip design, larger applications are being de-
veloped on FPGAs containing hundreds of operators. To facilitate rapid design space
exploration, circuits are generated automatically from the software description using
ROSE [107]. This allows the correct cores to be generated, of which there may be a
different core (with regards to size, representation and dedicated resources required)
for each operator. An architectural description of the device allows the design flow to
be customised. The description specifies the number of resources available and which
operations can be carried out on these resources. A demonstration of the approach is
shown on 8 benchmarks: a ray tracer, a B–splines design, the GARCH(1,1) financial
model, convolution, polynomial approximation, complex multiplication, Gaussian
blur and fast Fourier transform (section 3.4). A 15% improvement in area can be
gained if no DSP48 blocks2 are used and up to 11% if they are used. Additionally,
the number of DSP48 blocks can be reduced by up to 15%. Larger improvements
can be obtained if dedicated resources are limited or operators cannot be mapped to
dedicated resources, for example, floating-point addition (up to 27%).
The approach has three potential limitations:
There will be a small area reduction on integer applications because additional
representations, such as floating-point, will have a large overhead; there is no
improvement over a fixed-point Gaussian blur, however, the area of a floating-
point design with integer inputs and floating-point constants is reduced by
22%.
2Dedicated resources on a Xilinx device, containing a multiplier and adder.
122  Chapter 6: Summary and Conclusions
In general, an adder with a fixed-point representation is smaller than one
with a floating-point representation and a multiplier with a floating-point
representation is smaller than a multiplier with a fixed-point representation
(table 2.1). If adders and multipliers are interleaved, clusters of the same type
of operator are difficult to find. This means that if each operator were to be
realised in the representation that produced the most area-efficient circuit, a
large amount of hardware would be introduced to convert between the formats.
This would negate the area reduction gained by mixing different numerical
representations (polynomial approximation exhibits this, having a maximum
improvement of less than 3%).
It is not possible to reduce the area by a specified amount; the reduction cannot
be controlled because the accuracy is fixed. The approach can be combined with
a word-length optimisation technique to achieve this. In general, a resource
that is costly in one representation will be costly in another; an exception is the
logarithmic number system. In this number system multiplication is generally
less costly than addition. It is therefore not clear which technique should be
applied first. In practice, the logarithmic number system tends to provide a
small improvement, if any, unless the inputs are already in this format.
Despite these restrictions, the approach can still be used to select the optimal
representation, whether that be a mixed representation or not. If only fixed-point
and floating-point were adopted, it could be more beneficial to simply optimise the
floating-point units based on the range and precision required; barrel shifters could
be removed or at least reduced in size.
6.2 Scalable Word-Length Optimisation
The second contribution extends the first by guaranteeing the accuracy of the output.
This is a stronger constraint than guaranteeing the accuracy of a floating-point
algorithm which may not always be sufficient. To guarantee accuracy, constraints
must be added, for example, the number of bits of accuracy on each output and
the range of the inputs. The range of internal variables can be calculated by
propagating the ranges of each input to the output (or back-propagating the range
of the outputs [116]), by performing an operation on ranges as opposed to individual
6.2 Scalable Word-Length Optimisation  123
variables [100]. This has been extended [117] to accurately calculate ranges that are
correlated. Tighter bounds can be obtained by simulating the application, however,
this may be time-consuming. Precision analysis and optimisation — calculating the
width of the fractional part of a variable — is more complex. Consider y = a × b
where a has a 16-bit range and b has a 10-bit range. It is not immediately clear
what the optimal precision of a and b should be, given that the calculation must
guarantee that the error on y is lower than a specified value. It has been shown to
be an NP–Hard problem [28].
As explained in section 6.1, rapid, scalable analyses are becoming more important
as the size of software programs and hardware circuits increase. Previous approaches
to solving this problem, although shown to be near-optimal for small designs, have
proved to be too slow [87] (179 seconds and 32 seconds to analyse an 8×8 DCT and
a B–splines design respectively). With a larger application, the time will increase.
To reduce the run time while maintaining a near-optimal solution, three additional
techniques have been proposed (chapter 4):
1. Aggressive heuristics to estimate non-uniform word-lengths rapidly while meet-
ing error constraints (section 4.1.1). These heuristics allow much larger ap-
plications to be analysed, enabling energy to be reduced for large software
applications as well as hardware circuits (chapter 5).
2. A method of reducing the complexity of the problem by partitioning the data-
flow graph (section 4.1.2), increasing the performance of previous approaches [29,
108].
3. The use of information gathered at run time to calculate the precision required
for functions using an unknown algorithm, and the use of control-flow analysis
to reduce power consumption (section 4.2), extending static approaches [87].
1 Aggressive heuristics determine non-uniform word-lengths rapidly while meeting
error constraints (section 4.1.1). The word-lengths are rapidly reduced based on
their error and cost to improve results.
The approach has been tested on 7 applications: ray tracing, B–splines, Gaussian
blur, floating-point convolution, matrix-vector multiplication, polynomial approxim-
ation and RGB to YCbCr colour conversion. The heuristic is over 50 times faster
for a B–splines design with less than 2% increase in area compared with simulated
124  Chapter 6: Summary and Conclusions
annealing, shown to be within 1% of the optimal solution for small designs [87].
For a large convolution, the heuristics can produce better results than simulated
annealing in some cases. Although results generated with simulated annealing may
be improved, the algorithm would have to run for significantly longer. The longer
the algorithm runs, the smaller the improvement is likely to be.
2 Partitioning the data-flow graph has been shown to reduce the time taken to run
the algorithm without significantly compromising the solution. For a convolution
benchmark, the area is within 2% and the algorithm run time has improved by 25
times.
3 In this chapter, word-lengths are reduced while guaranteeing the error on the
output, tightening the constraint used in chapter 3 in which the error of a floating-
point algorithm was guaranteed. This means that error must be propagated from
input to output. The disadvantage is that every function in the algorithm must
be known. To tackle this, automatic differentiation is employed to calculate the
word-length of unknown library functions. This is demonstrated by optimising the
area of a ray tracer that contains a square root function. Without this technique the
width of the core would be unknown; the maximum word-length would have to be
used.
Although this approach can be applied to any application to reduce area and
power consumption, it has two potential drawbacks.
The output error can become unrepresentative of the real error if the application
contains loops that accumulate a large error.
In order to reduce energy, functions in a loop may need to be optimised first
despite the high error produced.
Despite the output error constraint being unrepresentative of the real error, operators
that cause a large error are identified. Some strategies, for example, minimising
energy, may require the width of an operator that is frequently used to be reduced
first even though a large error will be produced.
Error may have little correlation with cost. In this case a cost-based algorithm [34]
may give better results but will take longer to complete.
6.3 Systematic Run-Time Hardware Deactivation  125
6.3 Systematic Run-Time Hardware Deactivation
The third contribution combines two methods — multiplexer-based reconfiguration
and bitstream reconfiguration — with word-length optimisation to develop energy-
efficient run-time reconfigurable hardware circuits (section 5.1). The starting point is
a software application containing conditional statements and loops. If no conditional
branches exist, other triggers must be used to determine when reconfiguration should
occur. Compile-time word-length optimisation is adopted in conjunction with this
technique to reduce circuit area. Three questions are being tackled (chapter 5):
Which parts of a hardware design should be reconfigured at any given time?
When should a design be reconfigured?
How should a design be reconfigured?
To answer these questions, a model has been developed to estimate the energy
required using each approach such that the optimal reconfiguration strategy can be
selected.
Location Word-length analysis shows where to deactivate the unwanted bits.
Although phase analysis [69, 119] can be used to determine the overall accuracy
required by the system, word-length analysis is needed to show which bits can be
removed to guarantee that this accuracy is maintained. Word-length analysis can be
split into two components: range analysis to reduce the integer part of a variable
and precision analysis to reduce the fractional part. When a variable is not using its
full range, power will be reduced. Consider an unsigned integer variable capable of
storing 10 bits, currently storing a 5 bit value over several clock cycles. The most
significant bits will be zero, which means that the bits are essentially switched off,
although the clock will still be toggling. Now consider the same variable with a 10
bit fractional part (precision). If only 5 bits were required to meet the accuracy
constraint, the remaining 5 bits would not be used but could still contain toggling
values. Additional power is therefore consumed. Previous approaches to dynamic
word-length modification targeted integer [18, 101] and finite-precision [17] variables.
The proposed methodology applies to infinite-precision variables by calculating the
worst-case error. Locating the bits that can be switched off is crucial and is discussed
126  Chapter 6: Summary and Conclusions
in detail in chapter 4. In this thesis it is shown that word-length optimisation enables
power reductions of up to 2.7% per bit of output accuracy for a ray tracer.
Time The conditions under which multiplexer-based reconfiguration should be
used in preference to multiple bitstream reconfiguration are derived (section 5.2).
The analysis is based on the accuracy requirements. Multiplexing between different
configurations is fast (clock cycles) but less efficient since unused hardware remains
active on-chip and may not be able to be gated. Bitstream reconfiguration3 is much
slower (milliseconds) and consumes more power [5], however, no unwanted hardware
for that given phase remains on-chip, possibly reducing the power consumption
between reconfigurations.
Strategy The two reconfiguration methods are compared in section 5.3. Using clock
gating to stop the bits that are not required from switching is not always possible,
so other methods are used to achieve the same effect. Due to the area saving gained
by multiplexing individual bits with zero to reduce bit-switching, this approach is
adopted. This approach, although beneficial on FPGAs, can also be applied to
application-specific integrated circuits (ASICs). ASICs have two advantages in this
case. First, they consume less power [133] and second, they can be built to switch
off components regardless of the granularity required. If applied to a processor the
power consumption of the ALU could be reduced if low-accuracy computation were
required. Multiplexing enables the power of a multiplier to be reduced by up to 1.1%
per bit.
The greatest disadvantage of run-time word-length optimisation as a method of
power reduction in an FPGA is the power loss primarily due to the architecture of
the clock tree. Bitstream reconfiguration can be used to alleviate this problem in
some cases. Based on the results presented, a new device structure is proposed in
section 5.5 in which the clock tree is modified, enabling part of it to be switched off
on a more fine-grain level.
6.4 Future Work
A number of questions still need to be answered, discussed in the following sections.
3Bitstream reconfiguration is often referred to as run-time reconfiguration, however, since this
also applies to multiplexer-based reconfiguration, this is avoided.
6.4 Future Work  127
6.4.1 Combining Multiple Data Representations
Although the area of a circuit is reduced and the clock frequency is increased by
mixing different numerical representations, it is not clear how much power will be
saved. Power consumption in FPGAs and ASICs can be characterised as static or
dynamic. Since the architecture of the FPGA is not being altered, the focus is on
dynamic power, estimated as follows [80]:
P =
∑
r∈resources
CrV
2
r fr
where Cr, Vr, and fr are the capacitance, voltage, and operating frequency of resource
r, respectively. Dynamic power is the component of power related to switching (signal
transitions). Fixed-point adders are more power-efficient than floating-point adders,
however, there will be a smaller gap between floating-point multipliers and fixed-
point multipliers. Any converters may negate the power saving created by using
fixed-point adders, resulting in a very similar power consumption. The question
therefore becomes, does this approach save power as well as area? It is not clear
whether a separate cost model for power consumption will produce circuits that
consume less power (which is required when a compile-time, accuracy-guaranteed
approach [87] is selected [1]).
The approach has also not been tested on a wide range of embedded blocks because
the resources have not been available. Ho et al. [63] show how to evaluate embedded
functional blocks in FPGAs. This method can be adopted to look at how new
embedded blocks may affect results. Coupled with this, a wider variety of numerical
representations will be tested, particularly the logarithmic number system [49, 121],
to create efficient hardware devices when large clusters of multipliers exist.
6.4.2 Scalable Word-Length Optimisation
Phase analysis may help reduce the conservative nature of compile-time word-length
optimisation. The idea would be to perform a separate word-length analysis for each
phase of execution. This highlights the need for a fast algorithm.
When simulating an application, it is important that the results are as realistic
as possible, for example, when generating output images [50]. It will therefore be
desirable to use bit-accurate simulations. When a linear constraint solver is chosen
over a heuristic algorithm, an accurate cost model is required. Inaccurate cost models
128  Chapter 6: Summary and Conclusions
are problematic because the area reported by the mapping tools will differ from
the estimated cost; ILP will therefore be of little use. Since simulated annealing is
shown to produce results within 1% of the optimal [87], methods of creating more
representative cost functions should be used. A method of adding new cost models
will have to be included to cater for newly developed arithmetic units. This could be
solved by extending the architectural description to provide cost models specific to
each device. Several parameters will have to be considered.
The numerical format must have a description.
Each format must have a set of operand widths defined, for example, range or
exponent.
The number of stages in the pipeline must be given along with how this affects
the area.
Additional run-time information may be required, for example, the derivative
of the unit (section 4.2.1).
Given these parameters, a cost model can be created for a new operator. Coupled
with this, the resource binding problem must be addressed [31]
Resource scheduling has been used to optimise the number of pipeline stages [37].
This is becoming more important because the size and complexity of hardware
circuits is increasing.
6.4.3 Systematic Run-Time Hardware Deactivation
The current approach assumes that the entire bitstream is reconfigured. It may be
possible to produce a more efficient design if only part of it were reconfigured at
any given time. The proposed approach will be extended to cover reconfigurable
designs that make use of partial bitstream reconfiguration; the affect this has on
development time will be assessed. To reduce the overhead of reconfiguration
further, each bitstream configuration will be capable of performing multiplexer-based
reconfiguration. For example, when the device is reconfigured it may be able to
multiplex between an output word-length of x bits and y bits. When reconfigured
again, x and y could change. This technique may be combined with more efficient
controllers making use of dedicated devices.
6.4 Future Work  129
Different reconfiguration strategies and their impact on power and energy con-
sumption are being investigated in conjunction with different types of controller;
phase information [119] is being analysed to accomplish this study. The reconfig-
uration controller is important when deactivating parts of the circuit to conserve
power. One proposed strategy is for the FPGA to be self-reconfiguring [14]. Tools
have been developed to enable feedback from the environment to cause the chip to
reconfigure [43]. This would eliminate user intervention, increasing performance.
Coupled with this, it is not clear how much energy could be saved if power
models [1] were used to calculate the word-lengths. Given that the system alternates
between a maximum accuracy and a set of configurations with a lower accuracy, it may
be desirable to use an area-based analysis first (the high-accuracy configuration), as
in the current approach, and a power-based analysis for the remaining configurations.
Investigation into the power saving obtainable by using ASICs as opposed to
FPGAs will help determine whether this approach will be beneficial in embedded
processors. Soft core processors on FPGAs, for example, Xilinx Microblaze, Xilinx
Picoblaze and Altera Nios II can be modified to include a customised ALU. The
multiplier and adder in the ALU will be modified to enable bit deactivation. To
determine which bits should be deactivated, new instructions will be required. Two
questions will need to be answered:
What should the granularity of selection be?
Should every bit be able to be switched off?
The question of how many instructions should be affected by a change in word-length
will require a complex analysis. The greater the number of instructions affected, the
more power that may be wasted because a partition of instructions is not as efficient
as it could be, given that every instruction may require a slightly different accuracy.
The second question requires knowledge of the amount of power that can be saved by
disabling each bit in a processor. If disabling each bit does not significantly reduce
power it may be more beneficial to disable two bits at a time.
This method can be used to estimate the power saving achieved by realising the
approach in a general purpose processor. Coupled with this, the effect of the pipeline
structure on energy reduction per operation [128] when combined with word-length
optimisation is being explored.
130  Chapter 6: Summary and Conclusions
APPENDIX A
Reducing Circuit Area using Multiple
Data Representations
A.1.1 Ray Tracer Architectural Description
1 representation fixed, float;
2 operator multiply , add, sqrt;
3 resource dsp48s;
5 // Valid architectural choices.
6 architecture fixed multiply -> dsp48s;
7 architecture float multiply -> dsp48s;
9 // Invalid architectural choices.
10 disallowed architecture fixed sqrt;
12 // Architectural constraint.
13 limit dsp48s = 96;
Figure A.1: This description shows the information required to describe the architecture.
It is assumed that every core can be realised in lookup tables, multiplexers and registers;
hence, if this is not the case it must be explicitly stated. Disallowing architectural choices
can decrease the time taken to run the algorithm, although this is not done in the results
shown; for the benchmarks discussed, a fixed-point square root is allowed.
A.1.2 Cost Metrics
The fixed-point adder cost is defined as follows:
max(Range1, Range2) +min(Precision1, P recision2)
where Range1, Range2, Precision1 and Precision2 are the fixed-point widths. The
fixed-point multiplier cost is defined as follows:
(Range1 + Precision1)× (Range2 + Precision2)
132  Appendix A: Reducing Circuit Area using Multiple Data Representations
Algorithms, for example, Karatsuba may be employed to reduce the cost. Two
prediction methods are required to calculate the area of multipliers: one assuming
DSP blocks are used (if Mult18×18s are used, the number of LUTs depends on the
number of DSP blocks) and another without. Constant coefficient multipliers are
modelled in a similar way but multiplied by an empirically determined fraction to
reduce the size.
The floating-point adder cost can be estimated by summing the following com-
ponents:
Compare and select (exponent): Exponent.
Compare and select (mantissa): Mantissa.
Mantissa alignment: Mantissa× log2(Mantissa).
Addition: Mantissa.
Find leading one and shift: Mantissa + (Mantissa × log2(Mantissa)) +
Exponent.
Rounding.
Additional components will have to be added depending on the specific architecture
and number of pipeline stages. The leading-one detector will be larger if algorithms
with a greater performance are used.
The floating-point multiplier cost can be estimated by summing the following
components:
Exponent addition: Exponent.
Multiplication: Mantissa×Mantissa.
Rounding.
A quadratic function is used to estimate the size of the fixed-point and floating-point
square root because it is not clear which algorithm has been used. If there are only
a few floating-point formats, constants can be added to the estimate to ensure that
any device-specific optimisations are taken into account.
APPENDIX B
Scalable Accuracy-Guaranteed
Word-Length Optimisation
B.1.1 Source Code Annotations
Figure B.1 shows source code for the ray tracer demonstrating the use of #pragma
statements to provide information about variable range, precision and black-box
functions. The #pragma annotations are parsed by ROSE [107] and used by different
stages of the analysis.
134  Appendix B: Scalable Accuracy-Guaranteed Word-Length Optimisation
1 int Sphere::intersect(Ray *ray, ...)
2 {
3 Point a;
4 float b, dist, diff1, diff2, ...;
6 #pragma r1: min_range = 0; r1: max_range = ...
7 r1: error = 0;
8 #pragma diffEps: output_precision = 16;
9 float diffEps = r1 * 0.125;
11 // Vector subtraction.
12 a[0] = pos[0] - ray->start[0];
13 a[1] = pos[1] - ray->start[1];
14 a[2] = pos[2] - ray->start[2];
16 // Dot product (a . ray->dir).
17 b0 = a[0] * ray->dir[0];
18 b1 = a[1] * ray->dir[1];
19 b2 = a[2] * ray->dir[2];
20 b = b0 + b1 + b2;
22 // Dot product (a . a).
23 a0sqr = a[0] * a[0];
24 ...
25 adot = a0sqr + a1sqr + a2sqr;
27 #pragma dist: output_precision = 16; dist: ...
28 dist = r2 - adot + (b * b);
30 if (dist > 0.0)
31 {
32 #pragma sqrt: dydx = 1000; ...
33 sqrt_dist = sqrt(dist);
35 #pragma diff2: output_precision = 16;
36 diff2 = b + sqrt_dist;
38 if (diff2 > diffEps)
39 {
40 #pragma diff1: output_precision = 16;
41 diff1 = b - sqrt_dist;
43 if (diff1 > diffEps)
44 ...
Figure B.1: Source code annotations for a ray-sphere intersection. Annotations are used to
specify the minimum range, maximum range and error of each input variable. The output
precision determines the accuracy required. Black-box functions use dydx to specify the
sensitivity of a variable to error. Cost functions may also be specified as annotations.
 135
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
rand_heuristic
err_heuristic
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
err_heuristic
rand_heuristic
(b) algorithm run time
Figure B.2: Area and algorithm run time for the Gaussian blur benchmark at varying
levels of precision. It is shown that the heuristic algorithm is over 7 times faster than
simulated annealing.
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
(b) algorithm run time
Figure B.3: Area and algorithm run time for the Gaussian blur benchmark at varying levels
of precision using a partitioned data-flow graph. Partitioning the data-flow graph reduces
the time taken to run the algorithm — over 12 times faster than simulated annealing with
an area increase of less than 2%.
B.1.2 Heuristic Optimisation
Figures B.2 and B.4 show the effect of heuristic optimisation on a Gaussian blur and
RGB to YCbCr colour conversion respectively. Figure B.3 shows that partitioning
the data-flow graph reduces the algorithm run time for a small increase in area.
Given that the algorithm may be run several times on much larger applications, this
is an essential step.
136  Appendix B: Scalable Accuracy-Guaranteed Word-Length Optimisation
 800
 1000
 1200
 1400
 1600
 1800
 2000
 2200
 2400
 2600
 5  10  15  20  25  30
A r
e a
 [ L
U T
s ]
Precision [bits]
uniform
rand_heuristic
err_heuristic
heuristic
sa
(a) area
 0.01
 0.1
 1
 10
 100
 1000
 10000
 5  10  15  20  25  30
T i
m
e  
[ s ]
Precision [bits]
sa
heuristic
err_heuristic
rand_heuristic
(b) algorithm run time
Figure B.4: Area and algorithm run time for the RGB to YCbCr conversion benchmark
at varying levels of precision. The heuristic algorithm runs over 30 times faster with an
area within 1% of simulated annealing.
BIBLIOGRAPHY
[1] Altaf Abdul Gaffar, Jonathan A. Clarke, and George A. Constantinides. Powerbit - power
aware arithmetic bit-width optimization. In Proceedings of the International Conference on
Field-Programmable Technology, pages 289–292, December 2006.
[2] Altaf Abdul Gaffar, Oskar Mencer, Wayne Luk, Peter Y.K. Cheung, and Nabeel Shirazi.
Floating-point bitwidth analysis via automatic differentiation. In Proceedings of the IEEE
International Conference on Field-Programmable Technology, pages 158–165, December 2002.
[3] Jonathan Bachrach, Dany Qumsiyeh, and Mark Tobenkin. Hardware scripting in Gel. In
Proceedings of the 16th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines, pages 13–22. IEEE Computer Society, April 2008.
[4] Michael J. Beauchamp, Scott Hauck, Keith D. Underwood, and K. Scott Hemmert. Architec-
tural modifications to enhance the floating-point performance of FPGAs. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 16(2):177–187, February 2008.
[5] Ju¨rgen Becker, Michael Hu¨bner, and Michael Ullmann. Power estimation and power measure-
ment of Xilinx Virtex FPGAs: Trade-offs and limitations. In Proceedings of the 16th Annual
Symposium on Integrated Circuits and Systems Design, pages 283–288. IEEE Computer
Society, September 2003.
[6] Tobias Becker, Peter Jamieson, Wayne Luk, Peter Y.K. Cheung, and Tero Rissa. Power
characterisation for the fabric in fine-grain reconfigurable architectures. In Proceedings of the
5th Southern Conference on Programmable Logic, pages 77–82. IEEE, April 2009.
[7] Tobias Becker, Wayne Luk, and Peter Y. K. Cheung. Enhancing relocatability of partial
bitstreams for run-time reconfiguration. In Proceedings of the 15th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines, pages 35–44. IEEE Computer Society,
April 2007.
[8] Pavle Belanovic´ and Markus Rupp. Automated floating-point to fixed-point conversion with
the fixify environment. In Proceedings of the 16th IEEE International Workshop on Rapid
System Prototyping, pages 172–178. IEEE Computer Society, June 2005.
[9] Luca Benini and Giovanni De Micheli. Networks on chip: A new paradigm for systems on
chip design. In Proceedings of Design, Automation and Test in Europe, pages 418–419, March
2002.
[10] Luca Benini, Polly Siegel, and Giovanni De Micheli. Automatic synthesis of gated clocks for
power reduction in sequential circuits. IEEE Design and Test of Computers, 11(4):32–40,
1994.
138  Bibliography
[11] Richard Vincent Bennett, Alastair Colin Murray, Bjo¨rn Franke, and Nigel Topham. Combining
source-to-source transformations and processor instruction set extension for the automated
design-space exploration of embedded systems. ACM SIGPLAN Notices, 42(7):83–92, July
2007.
[12] Guiseppe Bernacchia and Marios C. Papaefthymiou. Analytical macromodeling for high-level
power estimation. In Proceedings of the International Conference on Computer Aided Design,
pages 280–283. IEEE Computer Society, November 1999.
[13] Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: Hardware design in
Haskell. In Proceedings of the 3rd ACM SIGPLAN International Conference on Functional
Programming, pages 174–184. ACM Press, 1998.
[14] Brandon Blodget, Philip James-Roxby, Eric Keller, Scott McMillan, and Prasanna Sundara-
rajan. A self-reconfiguring platform. In Field-Programmable Logic and Applications, volume
2778 of Lecture Notes in Computer Science, pages 565–574. Springer, September 2003.
[15] S. Bobba, I. N. Hajj, and N. R. Shanbhag. Analytical expressions for power dissipation of
macro-blocks in DSP architectures. In Proceedings of the International Conference on VLSI
Design, pages 358–365, January 1999.
[16] David Boland and George A. Constantinides. Automated precision analysis: A polynomial al-
gebraic approach. In Proceedings of the 18th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines. IEEE Computer Society, April 2010.
[17] Kiran Bondalapati and Viktor K. Prasanna. Dynamic precision management for loop
computations on reconfigurable architectures. In Proceedings of the 7th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines, pages 249–258. IEEE
Computer Society, April 1999.
[18] David Brooks and Margaret Martonosi. Value-based clock gating and operation packing:
Dynamic strategies for improving processor power and performance. ACM Transactions on
Computer Systems, 18(2):89–126, May 2000.
[19] Mihai Budiu, Majd Sakr, Kip Walker, and Seth Copen Goldstein. BitValue inference:
Detecting and exploiting narrow bitwidth computations. In 6th International Euro-Par
Conference, volume 1900 of Lecture Notes in Computer Science, pages 969–979. Springer,
September 2000.
[20] J. Adam Butts and Gurindar S. Sohi. A static power model for architects. In Proceedings of
the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 191–201,
December 2000.
 139
[21] Oswaldo Cadenas and Graham Megson. Power performance with gated clocks of a pipelined
Cordic core. In Proceedings of the International Conference on ASIC, volume 2, pages
1226–1230, October 2003.
[22] Yun Cao and Hiroto Yasuura. Leakage power reduction using bitwidth optimization. In
Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics,
pages 36–41, 2002.
[23] Mark L. Chang and Scott Hauck. Pre´cis: A design-time precision analysis tool. In Proceedings
of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 229–238. IEEE Computer Society, April 2002.
[24] Mark L. Chang and Scott Hauck. Least-significant bit optimization techniques for FPGAs.
In Proceedings of the ACM/SIGDA 12th International Symposium on Field-Programmable
Gate Arrays, pages 251–259, February 2004.
[25] Rui-Lin Chen and Chichyang Chen. Pipelined computation of very large word-length LNS
addition/subtraction computation with exponential convergence rate. In Proceedings of the
10th International Symposium on Pervasive Systems, pages 69–73. IEEE Computer Society,
2009.
[26] Jonathan A. Clarke, Altaf Abdul Gaffar, George A. Constantinides, and Peter Y. K. Cheung.
Fast word-level power models for synthesis of FPGA-based arithmetic. In Proceedings of the
IEEE Symposium on Circuits and Systems, pages 1299–1302, 2006.
[27] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and I. Bolsens. A methodology and design
environment for DSP ASIC fixed-point refinement. In Proceedings of Design, Automation
and Test in Europe, pages 271–277, March 1999.
[28] G. A. Constantinides and G. J. Woeginger. The complexity of multiple word-length assignment.
Applied Mathematics Letters, 15(2):137–140, February 2002.
[29] George A. Constantinides. Perturbation analysis for word-length optimization. In Proceedings
of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 81–90. IEEE Computer Society, April 2003.
[30] George A. Constantinides. Word-length optimization for differentiable nonlinear systems.
ACM Transactions on Design Automation of Electronic Systems, 11(1):26–43, January 2006.
[31] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk. Multiple wordlength resource
binding. In Field-Programmable Logic and Applications, volume 1896 of Lecture Notes in
Computer Science, pages 646–655. Springer, August 2000.
[32] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk. Optimal datapath allocation
for multiple-wordlength systems. Electronics Letters, 35(17):1508–1509, August 2000.
140  Bibliography
[33] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk. Heuristic datapath allocation
for multiple wordlength systems. In Proceedings of Design, Automation and Test in Europe,
pages 791–796, March 2001.
[34] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk. The multiple word-length
paradigm. In Proceedings of the 9th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines, pages 51–60. IEEE Computer Society, April 2001.
[35] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk. Synthesis of saturation
arithmetic architectures. ACM Transactions on Design Automation of Electronic Systems,
8(3):334–354, July 2003.
[36] Tim Courtney, Richard Turner, and Roger Woods. Mapping multi-mode circuits to LUT-
based FPGA using embedded MUXes. In Proceedings of the 10th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages 318–319. IEEE Computer Society,
April 2002.
[37] Jose´ Gabriel F. Coutinho, Jun Jiang, and Wayne Luk. Interleaving behavioral and cycle-
accurate descriptions for reconfigurable hardware compilation. In Proceedings of the 13th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 245–
254. IEEE Computer Society, April 2005.
[38] Jose´ Gabriel F. Coutinho, David B. Thomas, and Wayne Luk. Architectural exploration of
reconfigurable monte-carlo simulations using a high-level synthesis approach. In Proceedings
of Automatic Program Generation for Embedded Systems, October 2007.
[39] Peter J. Denning. The working set model for program behavior. Communications of the
ACM, 11(5):323–333, May 1968.
[40] Ashutosh S. Dhodapkar and James E. Smith. Managing multi-configuration hardware via
dynamic working set analysis. In Proceedings of the 29th Annual International Symposium
on Computer Architecture, pages 233–244, August 2002.
[41] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, and D. Poirier. A flexible floating
point format for optimizing data-paths and operators in FPGA based DSPs. In Proceedings
of the ACM/SIGDA 10th International Symposium on Field-Programmable Gate Arrays,
pages 50–55, February 2002.
[42] Robert G. Dimond, Oskar Mencer, and Wayne Luk. Combining instruction coding and
scheduling to optimize energy in system-on-FPGA. In Proceedings of the 14th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines, pages 175–184. IEEE
Computer Society, April 2006.
[43] P. C. Diniz and M. C. Rinard. Dynamic feedback: An effective technique for adaptive
computing. ACM SIGPLAN Notices, 32(5):71–84, May 1997.
 141
[44] Chun Te Ewe, Peter Y.K. Cheung, and George A. Constantinides. Dual fixed-point: An
efficient alternative to floating-point computation. In Field-Programmable Logic and Applica-
tions, volume 3203 of Lecture Notes in Computer Science, pages 200–208. Springer, August
2004.
[45] Claire Fang Fang, Rob A. Rutenbar, Markus Pu¨schel, and Tsuhan Chen. Toward efficient
static analysis of finite-precision effects in DSP applications via affine arithmetic modeling.
In Proceedings of the 40th Annual Design Automation Conference, pages 496–501, June 2003.
[46] Fang Fang, Tsuhan Chen, and Rob A. Rutenbar. Floating-point bit-width optimization for
low-power signal processing applications. In Proceedings of the International Conference on
Acoustic, Speech and Signal Processing, volume 3, pages 3208–3211, 2002.
[47] Amir H. Farrahi and Majid Sarrafzadeh. FPGA technology mapping for power minimization.
In Field-Programmable Logic Architectures, Synthesis and Applications, volume 849 of Lecture
Notes in Computer Science, pages 66–77. Springer, 1994.
[48] Haohuan Fu. Application Specific Number Representation. PhD thesis, Imperial College
London, October 2008.
[49] Haohuan Fu, Oskar Mencer, and Wayne Luk. Optimizing logarithmic arithmetic on FPGAs. In
Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines, pages 163–172. IEEE Computer Society, April 2007.
[50] Haohuan Fu, William Osborne, Robert G. Clapp, Oskar Mencer, and Wayne Luk. Accelerating
seismic computations using customized number representations on FPGAs. EURASIP Journal
on Embedded Systems, 2009(382983), 2009.
[51] Maya B. Gokhale, Janice M. Stone, Jeff Arnold, and Mirek Kalinowski. Stream-oriented
FPGA computing in the Streams-C high level language. In Proceedings of the 8th Annual
IEEE Symposium on Field-Programmable Custom Computing Machines, pages 49–56. IEEE
Computer Society, April 2000.
[52] Gokul Govindu, Ling Zhuo, Seonil Choi, Padma Gundala, and Viktor K Prasanna. Area, and
power performance analysis of a floating-point based application on FPGAs. In Proceedings
of the Seventh Annual Workshop on High Performance Embedded Computing, September
2003.
[53] Bjo¨rn Griese, Erik Vonnahme, Mario Porrmann, and Ulrich Ru¨ckert. Hardware support
for dynamic reconfiguration in reconfigurable SoC architectures. In Field-Programmable
Logic and Applications, volume 3203 of Lecture Notes in Computer Science, pages 842–846.
Springer, August 2004.
142  Bibliography
[54] Andreas Griewank, David Juedes, and Jean Utke. Algorithm 755: ADOL-C: A package for
automatic differentiation of algorithms written in C/C++ . ACM Transactions on Mathematical
Software, 22(2):131–167, June 1996.
[55] S.A. Guccione and D. Levi. Design advantages of run-time reconfiguration. SPIE, 3844:87–92,
September 1999.
[56] Shaori Guo and Wayne Luk. Compiling Ruby into FPGAs. In Field-Programmable Logic and
Applications, volume 975 of Lecture Notes in Computer Science, pages 188–197. Springer,
August 1995.
[57] Subodh Gupta and Farid N. Najm. Power macromodeling for high level power estimation. In
Proceedings of the 34th Annual Design Automation Conference, pages 365–370, June 1997.
[58] Subodh Gupta and Farid N. Najm. Analytical model for high-level power modeling of
combinational and sequential circuits. In Proceedings of the Alessandro Volta Memorial
Workshop on Low Power Design, pages 164–172, March 1999.
[59] Samuel Z. Guyer and Calvin Lin. An annotation language for optimizing software libraries.
In Proceedings of the 2nd Conference on Domain Specific Languages, volume 2, pages 39–52,
October 1999.
[60] Malay Haldar, Anshuman Nayak, Alok Choudhary, Prith Banerjee, and Nagraj Shenoy. FPGA
hardware synthesis from MATLAB. In Proceedings of the 14th International Conference on
VLSI Design, pages 299–304, January 2001.
[61] Michael Haselman, Michael Beauchamp, Aaron Wood, Scott Hauck, Keith Underwood, and
K. Scott Hemmert. A comparison of floating point and logarithmic number systems for
FPGAs. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines, pages 181–190. IEEE Computer Society, April 2005.
[62] Reza Hashemian and Bipin Sreedharan. A hybrid number system and its application in
FPGA-DSP technology. In Proceedings of the International Conference on Information
Technology: Coding and Computing, volume 2, pages 342–346, April 2004.
[63] Chun Hok Ho, Philip H. W. Leong, Wayne Luk, Steven J. E. Wilton, and S. Lopez-Buedo.
Virtual embedded blocks: A methodology for evaluating embedded elements in FPGAs. In
Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines, pages 35–44. IEEE Computer Society, April 2006.
[64] Chun Hok Ho, Chi Wai Yu, Philip H. W. Leong, Wayne Luk, and Steven J. E. Wilton.
Domain-specific hybrid FPGA: Architecture and floating-point applications. In International
Conference on Field-Programmable Logic and Applications, pages 196–201, August 2007.
[65] C. A. R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666–
677, August 1978.
 143
[66] Edson L. Horta, John W. Lockwood, and David Parlour. Dynamic hardware plugins for
an FPGA with partial run-time reconfiguration. In Proceedings of the 39th Annual Design
Automation Conference, pages 343–348. ACM, June 2002.
[67] Brad Hutchings, Peter Bellows, Joseph Hawkings, and Scott Hemmert. A CAD suite for
high-performance FPGA design. In Proceedings of the 7th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages 12–24. IEEE Computer Society,
April 1999.
[68] Akihiko Inoue, Hiroyuki Tomiyama, Eko Fajar Nurprasetyo, Hiroto Yasuura, and Hiroyuki
Kanbara. A programming language for processor based embedded systems. In Proceedings of
the Asia Pacific Conference on Hardware Description Languages, pages 89–94, July 1998.
[69] Canturk Isci and Margaret Martonosi. Phase characterization for power: Evaluating control-
flow-based and event-counter-based techniques. In Proceedings of the Twelfth International
Symposium on High-Performance Computer Architecture, pages 121–132, February 2006.
[70] Allan Jaenicke and Wayne Luk. Parameterised floating-point arithmetic on FPGAs. In
Proceedings of the IEEE International Conference on Accoustics, Speech and Signal Processing,
volume 2, pages 897–900, May 2001.
[71] Jun Jiang, Wayne Luk, and Daniel Rueckert. FPGA-based computation of free-form de-
formations. In Field-Programmable Logic and Applications, volume 2778 of Lecture Notes in
Computer Science, pages 1057–1061. Springer, August 2003.
[72] T. Jiang, X. Tang, and P. Banerjee. Macro-models for high level area and power estimation
on FPGAs. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI, pages 162–165,
April 2004.
[73] Ju¨rgen Becker, Michael Hu¨bner, Gerhard Hettich, Rainer Constapel, Joachim Eisenmann,
and Ju¨rgen Luka. Dynamic and partial FPGA exploitation. Proceedings of the IEEE,
95(2):438–452, February 2007.
[74] Heiko Kalte, Gareth Lee, Mario Porrmann, and Ulrich Ru¨ckert. REPLICA: A bitstream
manipulation filter for module relocation in partial reconfigurable systems. In Proceedings of
the 19th IEEE International Parallel and Distributed Processing Symposium, pages 151–158.
IEEE Computer Society, April 2005.
[75] Gershon Kedem. Automatic differentiation of computer programs. ACM Transactions on
Mathematical Software, 6(2):150–165, June 1980.
[76] Holger Keding, Martin Coors, Olaf Lu¨thje, and Heinrich Meyr. Fast bit-true simulation. In
Proceedings of the 38th Annual Design Automation Conference, pages 708–713, June 2001.
144  Bibliography
[77] Holger Keding, Markus Willems, Martin Coors, and Heinrich Meyr. FRIDGE: A fixed-point
design and simulation environment. In Proceedings of Design, Automation and Test in Europe,
pages 429–435, March 1998.
[78] Seehyun Kim, Ki-Il Kum, and Wonyong Sung. Fixed-point optimization utility for C and
C++ based digital signal processing programs. In Proceedings of the IEEE Workshop on VLSI
Signal Processing, pages 197–206, September 1995.
[79] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science,
New Series, 220(4598):671–680, May 1983.
[80] Matt Klein. Power considerations in 90nm FPGA designs. Xcell Journal, pages 56–59, Fourth
Quarter 2005.
[81] Israel Koren and Ofra Zinaty. Evaluating elementary functions in a numerical coprocessor
based on rational approximation. IEEE Transactions on Computers, 39(8):1030–1037, August
1990.
[82] Ki-Il Kum, Jiyang Kang, and Wonyong Wung. AUTOSCALER for C: An optimizing
floating-point to integer C program converter for fixed-point digital signal processors. IEEE
Transactions on Circuits and Systems II: Analog and digital signal processing, 47(9):840–848,
September 2000.
[83] Ki-Il Kum and Wonyong Sung. Combined word-length optimization and high-level synthesis of
digital signal processing systems. IEEE Transactions on Computer Aided Design of Integrated
Circuits and Systems, 20(8):921–930, August 2001.
[84] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215,
February 2007.
[85] Julien Lamoureux, Guy Lemieux, and Steven J. E. Wilton. GlitchLess: Dynamic power
minimization in FPGAs through edge alignment and glitch filtering. IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, 16(11):1521–1534, November 2008.
[86] Julien Lamoureux and Steven J. E. Wilton. Activity estimation for field-programmable gate
arrays. In Proceedings of the International Conference on Field-Programmable Logic and
Applications, pages 1–8, August 2006.
[87] Dong-U. Lee, Altaf Abdul Gaffar, Ray C. C. Cheung, Oskar Mencer, Wayne Luk, and
George A. Constantinides. Accuracy-guaranteed bit-width optimization. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 25(10), October 2006.
[88] Dong-U Lee, Altaf Abdul Gaffar, Oskar Mencer, and Wayne Luk. Optimizing hardware
function evaluation. IEEE Transactions on Computers, 54(12):1520–1531, December 2005.
 145
[89] Jian Liang, Russel Tessier, and Dennis Goeckel. A dynamically-reconfigurable, power-efficient
turbo decoder. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 91–100. IEEE Computer Society, April 2004.
[90] Jian Liang, Russell Tessier, and Oskar Mencer. Floating point unit generation and evolution
for FPGAs. In Proceedings of the 11th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 185–194. IEEE Computer Society, April 2003.
[91] Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock, and Thomas Panas. Extending
automatic parallelization to optimize high-level abstractions for multicore. In Evolving
OpenMP in an Age of Extreme Parallelism, volume 5568 of Lecture Notes in Computer
Science, pages 28–41. Springer, June 2009.
[92] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,
Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized
program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN
Conference on Programming Language Design and Implementation, pages 190–200, June
2005.
[93] Wayne Luk, Nabeel Shirazi, and Peter Y. K. Cheung. Modelling and optimising run-
time reconfigurable systems. In Proceedings of the 4th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 167–176. IEEE Computer Society, April
1996.
[94] Wayne Luk, Nabeel Shirazi, and Peter Y. K. Cheung. Compilation tools for run-time recon-
figurable designs. In Proceedings of the 5th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 56–65. IEEE Computer Society, April 1997.
[95] Arindam Mallik, Debjit Sinha, Prithviraj Banerjee, and Hai Zhou. Low power optimization by
smart bit-width allocation in a SystemC based ASIC design environment. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 26(3):447–445, March 2007.
[96] Scott McMillan and Steven A. Guccione. Partial run-time reconfiguration using JRTR. In
Field-Programmable Logic and Applications, volume 1896 of Lecture Notes in Computer
Science, pages 352–360. Springer, August 2000.
[97] Oskar Mencer. ASC: A Stream Compiler for computing with FPGAs. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 25(9):1603–1617, September
2006.
[98] Oskar Mencer, Martin Morf, and Michael J. Flynn. PAM-Blox: High performance FPGA
design for adaptive computing. In Proceedings of the 6th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages 167–174. IEEE Computer Society,
April 1998.
146  Bibliography
[99] J. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauwereins. Infrastructure
for design and management of relocatable tasks in heterogeneous reconfigurable system-
on-chip. In Proceedings of Design, Automation and Test in Europe, pages 986–991, March
2003.
[100] R. Moore. Interval Analysis. Englewood Cliffs, NJ: Prentice-Hall, 1966.
[101] Vasily G. Moshnyaga. Reducing switching activity of subtraction via variable truncation of
the most-significant bits. Journal of VLSI Signal Processing Systems, 33(1):75–82, January
2003.
[102] Anshuman Nayak, Malay Haldar, Alok N. Choudhary, and Prithviraj Banerjee. Precision and
error analysis of MATLAB applications during automated hardware synthesis for FPGAs. In
Proceedings of Design, Automation and Test in Europe, pages 722–728, March 2001.
[103] Jingzhao Ou and Viktor K. Prasanna. PyGen: A MATLAB/Simulink based tool for synthes-
izing parameterized and energy efficient designs using FPGAs. In Proceedings of the 12th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 47–56.
IEEE Computer Society, April 2004.
[104] Emre O¨zer, Andy P. Nisbet, and David Gregg. A stochastic bitwidth estimation technique
for compact and low-power custom processors. ACM Transactions on Embedded Computer
Systems, 7(3):1–30, April 2008. Article 34.
[105] Katarina Paulsson, Michael Hu¨bner, and Ju¨rgen Becker. On-line optimization of FPGA
power-dissipation by exploiting run-time adaption of communication primitives. In Proceedings
of the 19th Annual Symposium on Integrated Circuits and Systems Design, pages 173–178,
September 2006.
[106] A Peleg and U Weiser. MMX technology extension to Intel architecture. IEEE Micro,
16(4):42–50, August 1996.
[107] Daniel J. Quinlan, Markus Schordan, Qing Yi, and Andreas Saebjornsen. Classification and
utilization of abstractions for optimization. In Leveraging Applications of Formal Methods,
volume 4313 of Lecture Notes in Computer Science, pages 57–73. Springer, 2006.
[108] Sanghamitra Roy and Prith Banerjee. An algorithm for trading off quantization error with
hardware resources for MATLAB-based FPGA design. IEEE Transactions on Computers,
54(7):886–896, July 2005.
[109] Radu Rugina and Matin Rinard. Pointer analysis for multithreaded programs. In Proceedings
of the ACM SIGPLAN Conference on Programming Language Design and Implementation,
pages 77–90, May 1999.
[110] Li Shang and Niraj K. Jha. High-level power modeling of CPLDs and FPGAs. In Proceedings
of the International Conference on Computer Design, pages 46–51, September 2001.
 147
[111] Changchun Shi and Robert W. Brodersen. Automated fixed-point data-type optimization
tool for signal processing and communication systems. In Proceedings of the 41st Annual
Design Automation Conference, pages 478–483. ACM, June 2004.
[112] Nabeel Shirazi, Wayne Luk, and Peter Y. K. Cheung. Automating production of run-
time reconfigurable designs. In Proceedings of the 6th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 147–156. IEEE Computer Society, April
1998.
[113] Miguel L. Silva and Joa˜o Canas Ferreira. Support for partial run-time reconfiguration of
platform FPGAs. Journal of Systems Architecture, 52(12):709–726, December 2006.
[114] Tajana Sˇimunic´, Luca Benini, Giovanni De Micheli, and Mat Hans. Source code optimization
and profiling of energy consumption in embedded systems. In Proceedings of the 13th
International Symposium on System Synthesis, pages 193–198, 2000.
[115] Jennifer Stephenson. Design guidelines for optimal results in FPGAs. Altera, 2005.
http://www.altera.com/literature/cp/fpgas-optimal-results-396.pdf.
[116] Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. Bitwidth analysis with applica-
tion to silicon compilation. In Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages 108–120, June 2000.
[117] J. Stolfi and L. de Figueiredo. Self-Validated Numerical Methods and Applications. Institute
for Pure and Applied Mathematics, Rio de Janeiro, 1997.
[118] Henry Styles and Wayne Luk. Exploiting program branch probabilities in hardware compila-
tion. IEEE Transactions on Computers, 53(11):1408–1419, November 2004.
[119] Henry Styles and Wayne Luk. Compilation and management of phase-optimized reconfigurable
systems. In Proceedings of the International Conference on Field-Programmable Logic and
Applications, pages 311–316, August 2005.
[120] K. H. Tsoi, C. H. Ho, H. C. Yeung, and P. H. W. Leong. An arithmetic library and its
application to the N-body problem. In Proceedings of the 12th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages 68–78. IEEE Computer Society,
April 2004.
[121] Kuen Hung Tsoi. Computer arithmetic synthesis technologies on reconfigurable platforms. In
Proceedings of the International Conference on Field-Programmable Logic and Applications,
pages 713–714, August 2005.
[122] Richard H. Turner and Roger F. Woods. Design flow for efficient FPGA reconfiguration.
In Field-Programmable Logic and Applications, volume 2778 of Lecture Notes in Computer
Science, pages 972–975. Springer, September 2003.
148  Bibliography
[123] Michael Ullmann, Michael Hu¨bner, Bjo¨rn Grimm, and Ju¨rgen Becker. An FPGA run-time
system for dynamical on-demand reconfiguration. In Proceedings of the 18th Parallel and
Distributed Processing Symposium, pages 135–142. IEEE Computer Society, April 2004.
[124] Keith Underwood. FPGAs vs. CPUs: Trends in peak floating-point performance. In
Proceedings of the ACM/SIGDA 12th International Symposium on Field-Programmable Gate
Arrays, pages 171–180, February 2004.
[125] G. Vanmeerbeeck, P. Schaumont, S. Vernalde, M. Engels, and I. Bolsens. Hardware/Software
partitioning of embedded system in OCAPI-xl. In Proceedings of the 9th International
Symposium on Hardware/Software Codesign, pages 30–35, April 2001.
[126] Markus Willems, Volker Bu¨rsgens, Holger Keding, Thorsten Gro¨tker, and Heinrich Meyr.
System level fixed-point design based on an interpolative approach. In Proceedings of the
34th Annual Design Automation Conference, pages 293–298, June 1997.
[127] Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, Jen-
nifer M. Anderson, Steve W. K. Tjiang, Shih wei Liao, Chau wen Tseng, Mary W. Hall,
Monica S. Lam, and John L. Hennessy. SUIF: An infrastructure for research on parallelizing
and optimizing compilers. ACM SIGPLAN Notices, 29(12):31–37, December 1994.
[128] Steven J. E. Wilton, Su-Shin Ang, and Wayne Luk. The impact of pipelining on energy per
operation in field-programmable gate arrays. In Field-Programmable Logic and Applications,
volume 3203 of Lecture Notes in Computer Science, pages 719–728. Springer, August 2004.
[129] Michael J. Wirthlin and Brad L. Hutchings. Improving functional density using run-time
circuit reconfiguration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
6(2):247–256, June 1998.
[130] Francis G. Wolff, Michael J. Knieser, Dan J. Weyer, and Chris A. Papachristou. High-level
low power FPGA design methodology. In Proceedings of the IEEE National Aerospace and
Electronics Conference, pages 554–559, October 2000.
[131] Harold Wu¨st, Klaus Kasper, and Herbert Reininger. Hybrid number representation for the
FPGA-realization of a versatile neuro-processor. In Proceedings of the 24th EUROMICRO
Conference, volume 2, pages 694–701. IEEE Computer Society, August 1998.
[132] Saeyang Yang. Logic synthesis and optimization benchmarks user guide version 3.0. Technical
report, Micro Electronics Center of North Carolina, January 1991.
[133] Yan Zhang, Jussi Roivainen, and Aarne Ma¨mmela¨. Clock-gating in FPGAs: A novel and
comparative evaluation. In Proceedings of the 9th EUROMICRO Conference on Digital
System Design: Architectures, Methods and Tools, pages 584–590, October 2006.
