High-level power optimisation for Digital Signal Processing in Recon gurable Logic by Clarke, Jonathan A & Clarke, Jonathan A
High-level power optimisation for Digital
Signal Processing in Reconfigurable Logic
by
Jonathan A. Clarke
M.Eng(Hons)
A Thesis submitted in fulfilment of requirements for the degree of
Doctor of Philosophy of Imperial College London and
Diploma of Imperial College
Circuits & Systems Research Group
Department of Electrical and Electronic Engineering
Imperial College London
26th September 2008
2
3Abstract
This thesis is concerned with the optimisation of Digital Signal Processing (DSP) algorithm
implementations on reconfigurable hardware via the selection of appropriate word-lengths
for the signals in these algorithms, in order to minimise system power consumption. Whilst
existing word-length optimisation work has concentrated on the minimisation of the area of
algorithm implementations, this work introduces the first set of power consumption models
that can be evaluated quickly enough to be used within the search of the enormous design
space of multiple word-length optimisation problems. These models achieve their speed by
estimating both the power consumed within the arithmetic components of an algorithm
and the power in the routing wires that connect these components, using only a high-level
description of the algorithm itself. Trading off a small reduction in power model accuracy
for a large increase in speed is one of the major contributions of this thesis.
In addition to the work on power consumption modelling, this thesis also develops a
new technique for selecting the appropriate word-lengths for an algorithm implementation
in order to minimise its cost in terms of power (or some other metric for which models
are available). The method developed is able to provide tight lower and upper bounds on
the optimal cost that can be obtained for a particular word-length optimisation problem
and can, as a result, find provably near-optimal solutions to word-length optimisation
problems without resorting to an NP-hard search of the design space.
Finally the costs of systems optimised via the proposed technique are compared to
those obtainable by word-length optimisation for minimisation of other metrics (such as
logic area) and the results compared, providing greater insight into the nature of word-
length optimisation problems and the extent of the improvements obtainable by them.
4
5Acknowledgments
The work conducted for this thesis has been supervised by George A. Constantinides and
Peter Y. K. Cheung and funded by both the Engineering and Physical Sciences Research
Council (U.K.) and by Synplicity.
The supervision of George has been instrumental in focussing my concentration on
the key problems of my work, whilst also encouraging me to explore ideas and motivating
me to continue when faced with difficult problems. Our weekly meetings provided the
opportunity to discuss the (often mind-blowing!) details of many problems that have only
increased my interest in the areas explored by and related to those of this thesis. Peter’s
comments and enthusiasm for the field have also been of great help throughout this work,
and I must also thank him and George for providing me with the chance of working at
Synplicity during my research studies where I was able to spend time thinking about this
work in the context of the bigger picture and also working in related areas.
My parents David and Nieves Clarke have supported me throughout my educational
life and I am forever grateful to them for this and for everything they’ve done for me. I also
must thank my girlfriend Isobel for the great times we’ve had together and for listening to
the frightening intricacies of writing a thesis. I would like to thank all the great friends I’ve
had at Imperial over the years: the many flatmates I’ve had the pleasure of living with,
colleagues in the Circuits and Systems group, and the fun-loving gang from the Imperial
College Big Band and my quartet Twajazz. Finally I would like to thank the family dog
George whose love of steak is on a par with that of my supervisor George, and whose
antics continue to amuse me after 15 years.
6
7Contents
Abstract 3
Acknowledgments 5
Contents 7
List of Figures 11
List of Tables 21
Abbreviations 23
Chapter 1. Introduction 25
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4 Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 2. Background 31
2.1 Word-length optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.1 Fundamental difficulties in word-length optimisation . . . . . . . . . 32
2.1.2 Calculation of algorithm accuracy . . . . . . . . . . . . . . . . . . . 35
2.1.3 Calculation of cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8 CONTENTS
2.1.4 Word-length selection . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2 The reconfigurable computing platform . . . . . . . . . . . . . . . . . . . . 62
2.2.1 FPGA construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2.2 FPGA algorithm implementation . . . . . . . . . . . . . . . . . . . . 67
2.3 Estimation of power consumption of digital circuits . . . . . . . . . . . . . . 69
2.3.1 Power dissipation in digital CMOS circuits . . . . . . . . . . . . . . 70
2.3.2 Modelling of device power consumption . . . . . . . . . . . . . . . . 73
2.3.3 Summary of power consumption estimation . . . . . . . . . . . . . . 88
2.4 Summary of existing work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 3. Implementation details 91
3.1 Hardware and software used . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.1 Matlab and Simulink 2006a . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.2 System Generator 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1.3 Synplicity FPGA synthesis tools . . . . . . . . . . . . . . . . . . . . 93
3.1.4 Xilinx ISE 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.2 Construction of arithmetic components . . . . . . . . . . . . . . . . . . . . . 93
3.2.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter 4. Models for power consumption in arithmetic components 99
4.1 Estimating the activity within adders . . . . . . . . . . . . . . . . . . . . . 101
4.1.1 Transition density model for Xilinx Adders . . . . . . . . . . . . . . 101
4.1.2 Incorporation of signal activity profile information . . . . . . . . . . 103
4.1.3 Model characterisation . . . . . . . . . . . . . . . . . . . . . . . . . . 108
CONTENTS 9
4.1.4 Transition density model results . . . . . . . . . . . . . . . . . . . . 110
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Macro-model for power consumed in addition . . . . . . . . . . . . . . . . . 113
4.2.1 Relating power consumption to input word-length . . . . . . . . . . 113
4.2.2 Selection of macro-model statistical parameters . . . . . . . . . . . . 115
4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.3 Macro-model for LUT Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.1 Multiplier activity model . . . . . . . . . . . . . . . . . . . . . . . . 123
4.3.2 Relating power consumption to input word-length . . . . . . . . . . 129
4.3.3 Statistical parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4 Macro-model for Embedded Multipliers . . . . . . . . . . . . . . . . . . . . 136
4.5 Components with input signals containing glitches . . . . . . . . . . . . . . 138
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Chapter 5. Models for power consumption in configurable routing 143
5.1 Capacitance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1.1 Parameters that affect net capacitance . . . . . . . . . . . . . . . . . 146
5.1.2 Enhancing early capacitance estimates . . . . . . . . . . . . . . . . . 152
5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2 Fan-out estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.1 Input fan-out for adders . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.2.2 Input fan-out for multipliers implemented in LUTs . . . . . . . . . . 162
5.2.3 Input fan-out for embedded multipliers . . . . . . . . . . . . . . . . 163
5.2.4 Fan-out to multiple components . . . . . . . . . . . . . . . . . . . . 163
10 CONTENTS
5.3 Activity estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Chapter 6. Word-length selection using constrained optimisation 171
6.1 Sequential Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . 172
6.2 Word-length optimisation modifications for SQP . . . . . . . . . . . . . . . 174
6.2.1 Relaxation of integer constraints . . . . . . . . . . . . . . . . . . . . 175
6.2.2 Simplification of noise model to convex form . . . . . . . . . . . . . 177
6.2.3 Preventing zero-padding of signals . . . . . . . . . . . . . . . . . . . 180
6.2.4 Inter-component routing power convexity . . . . . . . . . . . . . . . 182
6.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.3.1 Optimal non integer improvements in area and power consumption . 188
6.3.2 Integer problem lower and upper bounds . . . . . . . . . . . . . . . . 197
6.3.3 Run times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.3.4 Power optimisation improvements in un-pipelined circuits . . . . . . 200
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Chapter 7. Conclusion 205
Bibliography 209
Appendix A. Diagram of SOS3 system 219
11
List of Figures
2.1 The interpretation of bits stored in fixed and floating-point format. Equa-
tions (2.2) and (2.3) show how these bits are interpreted as numerical values.
Each bit stores either the value 0 or 1. . . . . . . . . . . . . . . . . . . . . . 36
2.2 Three different types of distributions that the work in [KKS98] attempts to
estimate the range of using statistical parameters. . . . . . . . . . . . . . . 39
2.3 (a) The basic logic blocks available in an FPGA, comprised of a 4-input
LUT, carry logic, and a register. (b) The contents of a Virtex II Pro SLICE. 65
2.4 (a) The contents of a Virtex II Pro CLB (b) A typical island style arrange-
ment of logic and routing in an FPGA. . . . . . . . . . . . . . . . . . . . . . 66
2.5 An illustration of the design flow for FPGAs. . . . . . . . . . . . . . . . . . 67
2.6 An inverter implemented in CMOS using a p-type transistor (top) and n-
type transistor (bottom). The p-type transistor is connected to the power
supply wire whose voltage is Vdd and the n-type transistor is connected to
ground. The inverter input and output voltages are Vin and Vout, respectively. 70
2.7 The model of the parasitic capacitance of a CMOS logic gate described in
[RP00]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.8 The typical profile of a Gaussian signal, with activity regions 1 and 3 from
the DBT model highlighted. The breakpoints between activity regions and
the level of activity in region 3, Tmsb, are also shown. Bit 19 is the MSB. . 82
3.1 (a) A full adder as implemented on a Xilinx FPGA. As shown, the XOR
of the inputs A and B is computed in a LUT, whilst the remainder of the
necessary logic is performed in the carry logic provided. (b) A chain of full
adders that form a ripple-carry adder. . . . . . . . . . . . . . . . . . . . . . 94
12 LIST OF FIGURES
3.2 The above plot shows the area in SLICEs occupied by an adder on a Xilinx
Virtex II FPGA for input word-lengths between 2 and 32 bits. . . . . . . . 95
3.3 The aligned input word-lengths of input signals A and B of an adder. The
number of full adders required is adderwl, whilst the word-lengths and scal-
ings of signals A and B are nA, nB and pA, pB respectively. The triangle
represents the binary point and S represents the sign bit of each signal,
which in the case of signal A must be sign extended (shown by grey boxes).
As signal A has fewer LSB bits it alone determines the number of full-
adders needed to add the two signals, whilst the extra two LSB bits of B
are available ’for free’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4 A parallel multiplier as implemented on a Xilinx FPGA. The PP blocks
are partial product generators that compute the result of multiplying the
multiplier signal x with each bit of the multiplicand signal y. In one chain
of SLICEs, two partial products can be generated using the LUTs and
then added using the SLICEs’ carry chain. Each carry chain of SLICEs is
depicted in the diagram using a grey box. . . . . . . . . . . . . . . . . . . . 97
3.5 For input word-lengths between 2 and 32 bits, the above plots show: (a)
the area in SLICEs of parallel multipliers implemented in LUTs, and, (b)
the area in SLICEs of embedded multipliers, counting both embedded mult
blocks and any additional LUTs required, using the embedded multiplier
to SLICE ratio in (3.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 A full adder as implemented on a Xilinx FPGA. Signal Ln is the XOR of
the inputs An and Bn, computed in a LUT, whilst Cn−1, Cn and Sn are
the carry-in, carry-out and sum signals of this full adder, respectively. . . . 102
4.2 The output activity profile for a 32 bit adder, as estimated by the proposed
transition density-based method, before characterisation. The MSBs are
on the right-hand side of the x-axis. Input activity profiles are shown as
dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 The test circuit used to measure total adder output activity. . . . . . . . . . 109
LIST OF FIGURES 13
4.4 The activity profile estimated using our model (before and after characteri-
sation), compared to the profile obtained through low-level simulation. The
adder input signal profiles are also shown as dashed lines. . . . . . . . . . . 111
4.5 The total output activity estimated by XPower compared to the total out-
put activity estimated by our model for a 32 bit adder, using 500 different
pairs of input signals. The dashed lines show ±10% relative error margins. . 111
4.6 The activity profile at the output of an adder as estimated using our model.
The adder input signal profiles are also shown as thin dashed lines. . . . . . 115
4.7 The output activity of an adder with input signals that are appropriately
scaled and that have a range of signal correlations, as estimated by the
model in Section 4.1. The input signal lag-1 autocorrelations change from
(a) to (c) as follows: in (a) ρa, ρb = 0.99, (b) ρa, ρb = 0, and (c) ρa, ρb =
−0.99. The different coloured lines show the output activities of the adder
under different values of word-level cross-correlation ρab between the adder
inputs: the red line denotes ρab = 0.99, green denotes ρab = 0, blue denotes
ρab = −.99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.8 Variation in the dynamic power consumption of a 16-bit adder obtained
through Xilinx XPower while the word-level cross-correlation ρab and lag-1
auto-correlation ρb are varied. The other signal statistics are held constant:
σa = 1.0, σb = 1.0, ρa = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9 The estimated power consumption in various adders of word-lengths 4 to 32
bits, estimated by XPower (x-axis) and the proposed power macro-model
(y-axis). 150 tests involving random input statistics were performed. The
dashed lines show ±10% error margins. . . . . . . . . . . . . . . . . . . . . 123
14 LIST OF FIGURES
4.10 The logic used to generate one bit of the sum of a pair of partial products,
and a carry-out. Within a LUT, the bits Xn, Xn−1 of the multiplier signal
X are multiplied with two bits Yn, Yn−1 of the multiplicand signal Y , to
create a pair of partial products that are summed. An extra AND gate and
the carry logic available within a SLICE are required to calculate the carry-
out from the sum of the pair of partial product bits. Two bits of the sum
of two partial products can be calculated in the two LUTs and associated
logic of a SLICE in a Xilinx FPGA in this way. . . . . . . . . . . . . . . . 124
4.11 The size, sign extension, and alignment of the partial products summed
in the adder tree of an 8 × 8-bit adder. The left-most diagram shows the
eight partial products generated, where each black dot represents one bit
in a partial product, and partial products are arranged in rows with their
MSBs on the left hand side. Adjacent pairs of partial products that are
not separated by a line are added together, where in each case the top-
most partial product needs to be sign extended (shown by a grey dot).
The middle diagram depicts the partial product sums that result from the
first figure that must also be sign-extended, aligned and added. Similarly
the right-most diagram shows the final addition that must be performed to
calculate the product of the inputs. . . . . . . . . . . . . . . . . . . . . . . . 127
4.12 Curves showing the activity estimated using the transition density based
adder model (dashed lines) and switch-level simulation using ModelSim and
XPower (solid lines) for the output of multipliers of various word-lengths,
driven by uncorrelated inputs. The input word-lengths used are as follows,
stated in the form multiplier × multiplicand: (a) 8 × 8, (b) 16 × 8, (c)
16× 16, and, (d) 32× 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.13 Variation in the dynamic power consumption in a 16 × 16-bit multiplier
while the word-level cross-correlation ρxy and auto-correlation in input y,
(i.e. ρy), are varied. The other signal statistics are held constant: σx = 1,
σy = 1, ρx = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
LIST OF FIGURES 15
4.14 The estimated power consumption in various multipliers of word-lengths 4
to 32 bits, estimated by XPower (x-axis) and the proposed power macro-
model (y-axis). 100 tests involving random input statistics were performed.
The dashed lines show ±10% error margins. . . . . . . . . . . . . . . . . . . 134
4.15 The variation in power consumption of a multiplier implemented in LUTs
(y-axis) as the multiplier input word-length is changed (x-axis) for four
different multiplicand word-length values: 4, 10, 18 and 26. . . . . . . . . . 135
4.16 The power consumption of embedded multipliers of input word-lengths be-
tween 4 and 36 bits. Inputs with the same signal statistics are used through-
out, i.e. ‘well scaled’ uncorrelated Gaussian inputs. . . . . . . . . . . . . . . 137
4.17 The estimated power consumption in embedded multipliers of input word-
lengths from 4 to 32 bits, estimated by XPower (x-axis) and the proposed
power macro-model (y-axis). 100 tests involving random input statistics
were performed. The dashed lines show ±10% error margins. . . . . . . . . 138
4.18 The power consumption of the components in the system SOS3 LP. . . . . 142
5.1 From left to right, the information available about a circuit that is relevant
to capacitance estimation, at various stages in the FPGA design flow. i)
Before placement component types and inter-component connections are
known. ii) After placement component location, net bounding box, and
which LUT pins are used is known. iii) After routing the wire segments
and switch box connections are known. . . . . . . . . . . . . . . . . . . . . 146
5.2 A net driven by component S, connected to two points in both components
D1 and D2 (Fan-Out of four). The dotted line is the Bounding Box (BB)
of the net. Wire-Length (WL) is the sum of the lengths of the net’s segments.147
16 LIST OF FIGURES
5.3 In (a), the RMSRE achieved when the methods shown were fitted to the
capacitance values from all the inter-routing wires extracted from the bench-
marks from Table 5.1 fitted to a Virtex II Pro device. In (b), a box and
whisker plot for each method of estimating capacitance shown in (a), where
the whiskers show the minimum and maximum Relative Errors in inter-
routing wire capacitance for each method, and the boxes show the upper
and lower quartiles and median Relative Errors in inter-routing wire capac-
itance for each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.4 The RMSRE achieved when the methods shown were fitted to the capaci-
tance values from all the inter routing wires extracted from the benchmarks
from Table 5.1, when the benchmarks were fitted to a Virtex 4 XC4VLX40
device. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5 A simple netlist i) whose placement is approximated by the method de-
scribed in this section, resulting in a one-dimensional placement such as
the one depicted in ii). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.6 In (a), the RMSRE in capacitance for the inter-routing wires from Ta-
ble 5.1, averaged across 4 alternative placements of each circuit, using the
estimation techniques listed above. In (b), a box and whisker plot for each
method of estimating capacitance shown in (a) each plot showing the Rela-
tive Error in capacitance for the inter-routing wires from Table 5.1 averaged
across 4 alternative placements of each circuit, where the whiskers show the
minimum and maximum Relative Errors and the boxes show the upper and
lower quartiles and median Relative Errors for each method. . . . . . . . . . 159
5.7 The Spearman’s rank correlation coefficients between the capacitances of
the wires in each benchmark and the estimates made by the models shown.
The corresponding P-values indicate the probability of achieving a partic-
ular level of correlation using a randomly selected ordering of nets. . . . . . 160
LIST OF FIGURES 17
5.8 The diagram depicts a wire that fans-out from Adder1 to three other com-
ponents. The number of fan-outs for each bit of the wire that are within
Adder2, Mult1 and Mult2 can be determined for each component individu-
ally according the structure of the component and its input word-lengths,
as is demonstrated in this section. . . . . . . . . . . . . . . . . . . . . . . . 162
5.9 A component with an output that branches out to three other compo-
nents is shown. Different input word-lengths are used for each component,
i.e. from top to bottom word-lengths of 3, 2, and 1, respectively. Accord-
ing to the input word-lengths used for each component, different output
bits will branch out to different components, which must be accounted for
correctly as described in this section. . . . . . . . . . . . . . . . . . . . . . . 164
5.10 The inter-routing power consumption of the components in the system SOS3
LP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1 Noise injection schemes for truncating a signal with three fan-out branches
from word-length wl0 to word-lengths wl1, wl2 and wl3. The quantiser
blocks Q inject noise signals N1, N2 and N3 into the branches of the signal
as a result of truncation. These noise signals pass through combinations
of the transfer functions H1, H2 and H3, to the system output. Scheme
(a) correctly calculates the noise injected when wl1 > wl2 > wl3 but non-
convexity can occur if this order of word-lengths changes, as the destination
transfer functions also change. Scheme (b) only accounts for the truncation
noise common to H1, H2 and H3, but the optimal ordering of word-lengths
can be determined to make the scheme convex with the enforcement of
appropriate constraints as described below. . . . . . . . . . . . . . . . . . . 178
6.2 (a) The improvements in area when using the minimum area word-lengths
instead of the optimum uniform word-lengths. (b) The improvements in
dynamic power consumption when using the minimum power word-lengths
instead of the optimum uniform word-lengths. . . . . . . . . . . . . . . . . . 189
18 LIST OF FIGURES
6.3 (a) The improvements in area when using the minimum area word-lengths
instead of the minimum sum of uniform word-lengths. (b) The improve-
ments in dynamic power consumption when using the minimum sum of
uniform word-lengths instead of the minimum power word-lengths. . . . . . 191
6.4 The optimum cost points of several 2-signal systems with word-lengths wlx
and wly. Each plot shows the cost of the system (y-axis) as wlx (x-axis)
and wly (not shown) are changed whilst meeting a fixed noise constraint.
Each system uses a different cost function as shown below each plot. The
optimum point for system (a) is marked by a circle, and the same point is
also marked by a circle in plots (b)-(d). The crosses in plots (b)-(d) mark
the optimum points for each of those plots. The values of wlx and wly at
each optimum point is shown in brackets. . . . . . . . . . . . . . . . . . . . 192
6.5 (a) The improvements in area when using the minimum area word-lengths
instead of the minimum power word-lengths. (b) The increase in dynamic
power consumption when using the minimum power word-lengths instead
of the minimum area word-lengths. . . . . . . . . . . . . . . . . . . . . . . . 195
6.6 The percentage increase in area of the integer word-length upper-bound and
heuristic solution, relative to the lower-bound given by the minimum area
non integer word-lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.7 The percentage increase in power consumption of the integer word-length
upper-bound and heuristic solution, relative to the lower-bound given by
the minimum power non integer word-lengths. . . . . . . . . . . . . . . . . . 198
6.8 The time required to run the non integer word-length optimisation procedures199
6.9 An un-pipelined path in a circuit. The increased power consumption as a
result of glitches passed between components is modelled by doubling the
inter-routing power of Mult1, doubling the internal power of Adder1, qua-
drupling the inter-routing power of Adder1, and quadrupling the internal
power of Mult2. Register2 is incorporated into the output of Mult2 and
hence the inter-routing power of Mult2 is not increased. . . . . . . . . . . . 201
LIST OF FIGURES 19
6.10 The non integer word-length optimisation for power minimisation improve-
ments achieved compared to the optimisations shown when un-registered
components are modelled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.1 The System Generator block diagram representing the system SOS3 LP: a
low-pass IIR filter organised as three second order sections. . . . . . . . . . 220
20 LIST OF FIGURES
21
List of Tables
2.1 A comparison of range analysis techniques. . . . . . . . . . . . . . . . . . . 59
2.2 A comparison of quantisation noise analysis techniques. . . . . . . . . . . . 60
2.3 A comparison of word-length selection techniques. . . . . . . . . . . . . . . 61
2.4 A comparison of power estimation techniques. . . . . . . . . . . . . . . . . . 90
5.1 Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.1 Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
22 LIST OF TABLES
23
Abbreviations
ASIC: Application Specific Integrated Circuit
CAD: Computer Aided Design
CLB: Configurable Logic Block
CMOS: Complementary Metal Oxide Semiconductor
DBT: Dual Bit Type
DSP: Digital Signal Processing
FIR: Finite Impulse Response
FPGA: Field Programmable Gate Array
Hz: Hertz
IIR: Infinite Impulse Response
ILP: Integer Linear Programming
LMS: Least Mean Squared
LP: Linear Programming
LSB: Least Significant Bit
LTI: Linear Time Invariant
LUT: Look-Up Table
MILP: Mixed Integer Linear Program
MSB: Most Significant Bit
NMOS: N-type Metal Oxide Semiconductor
PDF: Probability Distribution Function
PE: Polynomial Evaluator
PMOS: P-type Metal Oxide Semiconductor
RAM: Random Access Memory
24 LIST OF TABLES
RMSRE: Root Mean Squared Relative Error
RTL: Register Transfer Level
SNR: Signal to Noise Ratio
SOS: Second Order Section
SQP: Sequential Quadratic Programming
VHDL: VHSIC Hardware Description Language
VPR: Virtual Place and Route
W: Watts
25
Chapter 1
Introduction
1.1 Motivation
The computation of arithmetic operations by both humans and machines is typically
performed using low levels of finite precision, as time, physical space and effort are required
to process every digit when a number must be computed, stored or written down, and the
costs of these resources prevent the use of full precision.
Unfortunately a lack of precision can lead to incorrect results which can render the
performance of any arithmetic operation useless: hence infinite precision, or at least as
much precision as possible, is preferable when storing any number or computing the result
of any mathematical operation. In order to achieve accurate results in useful periods
of time without wasting available space and energy it is necessary to strike a balance
between the numerical precision used and the resulting accuracy of a computation. For
computations involving variables that are each represented by a number of symbols that
correspond to the variable’s value, word-length optimisation is the name of the process of
performing a trade-off between the precision or word-length (i.e. number of symbols) used
for variables and the accuracy of computation.
Word-length optimisation forms part of an ever increasing array of tools available
for automated algorithm optimisation in the face of rapidly increasing design complexity.
These tools are necessary because since their inception integrated circuits that facilitate
26 LIST OF TABLES
the performance of computations on vast amounts of data have been becoming more com-
plex at an exponential rate, as observed by Moore’s Law [Moo65]. The rapidly increasing
complexity of computing devices means that they are becoming more and more powerful
and hence able to solve larger and more complex problems, but paradoxically their com-
plexity means it is becoming more and more difficult to exploit the computational power
available in these devices.
In order to combat this situation program designers require tools to aid them in
writing programs that exploit the capabilities of target computation devices in order to
achieve maximum performance, in spite of the complexities of the devices themselves.
Designers are faced with a plethora of options over the amount of precision to use
for the performance of arithmetic operations in order to achieve a required level of accu-
racy, and until the advent of the first automatic techniques for word-length optimisation
these options would have to be explored through trial and error, if at all. Clearly au-
tomatic word-length optimisation can reduce the complexity of designing programs for
modern computation devices by reducing or removing the need for designers to evaluate
and accommodate for the computational accuracy of their programs.
Moreover, computing devices take time to calculate a result, require space for elec-
tronic hardware to compute a result, and consume energy in the form of electricity in order
to drive signals through a circuit and to perform logic operations on these signals. As de-
scribed above, there is an intrinsic link between each of these costs and the word-lengths
or precision used to perform computation.
While the speed of and area required by computing devices have long been the sub-
ject of much industrial and academic interest, the power consumed by them has not been
perceived to be so significant until more recently, with the advent of portable devices that
have limited battery life and the development of high-performance devices whose power
consumption prevents them from being cost effective due to the cost of electricity, power
distribution and the prevention of overheating or device failure due to power dissipation
[RJD98; Pig04]. Hence high power consumption is a newer problem for which designers
require additional aid in order to manage the complexity of creating programs that can
1.1 Motivation 27
exploit the capabilities of computing devices.
Unfortunately the computational complexity involved in performing automatic
word-length optimisation is high, as the optimal solution to a problem is obtained from
only one particular combination of variable word-lengths among a vast number of other
possible word-length combinations in the design space. Previous work in the field of au-
tomatic word-length optimisation has concentrated on the use of heuristic methods for
finding ‘good’ solutions to the problem, but the quality of these solutions is not guaran-
teed. Thus one of the primary objectives of the work in this thesis is the development of a
new method for automatically performing the word-length optimisation trade-off and for
providing tight bounds on optimal solutions to the problem.
However due to the enormous array of application domains and computing devices
that have been developed for these it is beyond the scope of this thesis to develop an
automatic word-length optimisation tool that optimises the power consumption of any
algorithm intended for any application domain, implemented on any computing device.
This is because every application domain may have different notions of what it is that
makes a result accurate, and because there exist several different classes of algorithm
that each have different limits on what can be reasoned about their behaviour without
execution of the algorithm itself. As will be seen, these factors are critical to the nature
of a particular word-length optimisation problem.
The objective of this thesis then is to describe an automatic word-length optimisa-
tion technique that has been developed to trade-off the power consumed by a reconfigurable
computing device against the accuracy of results generated by a Digital Signal Processing
(DSP) algorithm running on that device. DSP algorithms intended for real-time appli-
cations in areas such as video processing or data analysis are very computationally de-
manding, but fortunately they can be highly parallelised in order to take advantage of the
computational power available on reconfigurable devices. However, significant designer
effort is required to create the highly customised designs necessary to exploit the high
performance features available on these devices.
Application Specific Integrated Circuits (ASICs) allow similar or even greater fine
28 LIST OF TABLES
grain control over algorithm implementations, however due to their extremely high cost
(the mask set for a 90nm process costs in excess of $1 million [Zah03]) they are too expen-
sive for use in any other than high-volume applications. Reconfigurable computing devices
on the other hand are much more cost-effective for low-volume applications but are not as
efficient as ASICs: according to a comparison between Field Programmable Gate Arrays
(FPGAs) and standard cell ASICs in [KR06] FPGAs are 3.2 times slower, use 40 times
more silicon area to implement the same functionality and have 12 times higher levels of
power consumption on average. Notwithstanding reconfigurable devices like FPGAs offer
the possibility of high-performance implementations of a myriad of applications that stand
to benefit significantly from Computer Aided Design (CAD) tools that allow designers to
maximise performance, despite the complexity of the devices at hand, in order to bring
such devices closer to higher performance ASICs.
Hence the work presented in this thesis is necessarily focused on a particular word-
length optimisation problem. Nevertheless, not only does the work presented here extend
the knowledge and understanding of word-length optimisation and methods for solving
such problems automatically, but it also develops new methods for estimating the power
consumption in a particular class of device in order to guide automatic word-length opti-
misation.
1.2 Objectives
The objective of this thesis is to provide a new method to aid the design automation of
FPGA implementations of DSP algorithms by automatically selecting the signal word-
lengths necessary to minimise the power consumption of these implementations. This
thesis develops new techniques to facilitate the location of optimal solutions to word-length
optimisation problems, by providing tight bounds on their cost, as well as new techniques
to allow the fast estimation of the power consumption of FPGA implementations of DSP
algorithms during the search of the word-length optimisation design space.
1.3 Overview 29
1.3 Overview
The contents of this thesis are organised as follows. Chapter 2 summarises the complexity
of both word-length optimisation and of the estimation of the power consumption of FPGA
devices, before going into detail on existing work in these fields. Chapter 3 provides
a summary of relevant aspects of the FPGA devices targeted in this work and of the
software and hardware tools used to provide the results of this thesis.
Chapter 4 describes a novel set of models that allow the fast estimation of the
power consumed in the arithmetic operations of a DSP algorithm. In order to do this,
new techniques for analysing the way in which power is consumed in these arithmetic
operations are developed so that the minimum set of parameters required to construct
fast and accurate power models can be determined.
In Chapter 5, a method for quickly estimating the power consumed in the wires
that communicate data between arithmetic components is described. This problem is
particularly difficult as there is very little information available on the construction of
these wires unless the entire circuit is fully elaborated, placed, and routed.
A new method for performing automatic word-length optimisation is described in
Chapter 6 that is able to provide tight bounds on optimal solutions to instances of the
problem. The power consumption models developed in Chapters 4 and 5 are then used
with this new word-length optimisation technique to minimise the power consumed in a
set of benchmark circuits and to then analyse the benefits obtained by using the described
techniques in this thesis.
The thesis is concluded in Chapter 7 with a summary of its key points and contribu-
tions, as well as a description of areas for future research resulting from the developments
made in this work.
30 LIST OF TABLES
1.4 Statement of Originality
The original contributions of the work in this thesis to the fields of word-length optimisa-
tion and power consumption estimation are listed below. The research conducted for this
thesis has resulted in several publications that are indicated along with each contribution.
Additional and more detailed explanations of these contributions are given in the chapters
of this thesis as indicated.
• The development of a set of power consumption macro-models for arithmetic com-
ponents whose form is established through detailed analysis of the structure and
operation of these components (Chapter 4, [CGC05; CGCC06; CCCS08]).
• A novel method for estimating the power consumed in the wires that connect arith-
metic components, without the need for placement or routing of a circuit (Chapter 5,
[CCC07]).
• A method for providing tight lower and upper bounds to optimal solutions for word-
length optimisation problems in order to guarantee near-optimal solutions, and the
first results on multiple word-length optimisation for power consumption minimi-
sation (Chapter 6, submitted to the ACM Transactions on Design Automation of
Electronic Systems [CCC08]).
31
Chapter 2
Background
This chapter provides an overview of existing work in the subject areas that are brought
together in the work described in this thesis, namely word-length optimisation and the
modelling of power consumption of digital electronic devices. A description of the con-
cept and complexities of word-length optimisation is given in Section 2.1 in order to give
context to the existing work that is then summarised. Section 2.2 then summarises the
reconfigurable computing platform which is intended to implement the high throughput
real-time DSP algorithms that are the target of word-length optimisation in this work.
Section 2.3 then begins with a summary of the processes through which power is dissipated
in digital CMOS circuits such as reconfigurable devices, before presenting existing work
that uses knowledge of these processes to estimate the power consumed in reconfigurable
hardware.
2.1 Word-length optimisation
Word-length optimisation is the name given to the process of selecting the word-lengths
of the signals or variables that communicate data between operations in an algorithm,
whilst ensuring results calculated by the algorithm are accurate. The word-lengths used
have an effect on several costs of a circuit or program such as circuit speed, area or power
consumption, and a word-length optimisation procedure can aim to minimise one or more
32 LIST OF TABLES
of these costs through the selection of appropriate word-lengths.
Although the concept itself may seem straightforward, the reality is that automatic
word-length optimisation is a ‘difficult’ problem to solve. There are a variety of possible
algorithms, application domains and computing devices to which automatic word-length
optimisation can be applied, but it is still possible to elaborate on this difficulty by making
some observations about the general nature of word-length optimisation problems as is
described in the following section. Previous work on word-length optimisation will then
be discussed and compared to these fundamental difficulties associated with word-length
optimisation in Sections 2.1.2, 2.1.3 and 2.1.4.
2.1.1 Fundamental difficulties in word-length optimisation
Assuming that it is possible to quantatively express the accuracy A of the output of
an algorithm implementation and some cost C of that implementation that we wish to
optimise, then the resulting word-length optimisation can be formalised as an optimisation
problem as follows:
minimise C(w)
subject to: A(w) ≥ α
(2.1)
where w is a vector representing the word-lengths of the algorithm and α is a numerical
value expressing some accuracy level that we wish the optimised algorithm to achieve.
The optimisation in (2.1) expresses the problem “find the vector of word-lengths w that
minimises the value of C(w) whilst ensuring that A(w) > α is satisfied”. Note that
the nature of the functions A(w) and C(w) are specific to the word-length optimisation
problem at hand, i.e. the type of algorithm to be optimised, the way in which accuracy
is measured for this application, and the type of device for which the algorithm is being
optimised.
In order to express what it is that the word-lengths w of the system represent it
is necessary to explain what is meant by an algorithm. According to the definition in
[Knu97a], an algorithm is “a finite set of rules that gives a sequence of operations for
2.1 Word-length optimisation 33
solving a specific type of problem”. Let us assume then that the result of each operation
in an algorithm is a numerical value that is either communicated to later operations in
the sequence that require this value, or is the result of the algorithm. In the process
of communicating a numerical value that value is stored in some form which we shall
assume to be a collection of symbols that relate in some way to the numerical value. Let
us denote the word-length wij as the number of symbols that are used to communicate
a value between a pair of operations i and j. We can now state that wherever data is
transferred between a pair of operations in an algorithm there exists a word-length that
is one of the elements in the vector w that represents the number of symbols used to
communicate values between that pair of operations.
The larger the number of symbols that are used to communicate a value, the greater
the precision with which that value can be expressed. It is important to observe then
that an integral number of symbols are used to communicate numerical values between
operations in an algorithm.
Hence all the word-lengths in the vector w used in an algorithm must be integers.
Optimisation problems such as (2.1) that involve only integer variables are called com-
binatorial optimisation problems and are often NP-hard problems [KV05], meaning that
the computational effort required to find the optimal solution to such problems rises ex-
ponentially, in practice, as the size of the problem increases. An insight into the nature of
this complexity is given by the following discussion. Firstly, let us denote n as the number
of data communications between operations in an algorithm, in which case size(w) = n
and so the word-length optimisation problem at hand has n dimensions. Furthermore, let
us assume that it is known that for a given accuracy constraint A(w) = α, the maximum
word-length that will be required for any data communication in the algorithm is wlmax.
We can thus establish that there are (wlmax)n possible combinations of word-lengths w in
the design space. Unfortunately the number of combinations of w that must be examined
in order to determine the optimum solution in the design space increases exponentially
with n, despite the fact that techniques exist that do not require all (wlmax)n design
options to be examined (such as the branch and bound technique summarised [KV05]).
34 LIST OF TABLES
Despite the potential complexity of word-length optimisation problems, only one
particular type of word-length optimisation problem (i.e. one with particular functions
A(w) and C(w)) has been proven to lie in the NP-hard category of algorithm complexity:
the one described in [CW02]. Other word-length optimisation problems imply the need
for different functions A(w) and C(w) so it is not clear that all word-length optimisation
problems are NP-hard, however the word-length optimisation problem described in [CW02]
is much simpler than more complex problems approached in other existing work, and only
special simple cases of problems with integer unknowns are known to not be NP-hard
problems. No word-length optimisation that has been studied in previous work has been
shown to be optimally solvable in polynomial time.
As a result of the complexity of finding an optimal solution to word-length optimi-
sation problems, previous work has concentrated on the use of heuristics that search the
available design space for a particular problem in order to find a solution that is hopefully
close to optimal.
The complexity of the methods for finding solutions to word-length optimisation
also places important requirements on the functions A(w) and C(w) as follows. Despite
the enormous design space to be searched, it may be possible to find optimal or near
optimal solutions for non-trivial problems if A(w) and C(w) can be evaluated extremely
quickly for any particular w. If not then none but the smallest of problems can be solved
optimally. This has implications on the range of algorithms and target devices that word-
length optimisation can be applied to, as the accuracies and costs of some are easier to
estimate than others.
The above forms a summary of the difficulties in performing automatic word-length
optimisation for an algorithm that should be borne in mind when studying the merits or
otherwise of any particular technique. Existing work on word-length optimisation is in-
troduced in the sections below as follows: Sections 2.1.2 and 2.1.3 describe methods for
quickly estimating the computational accuracy and cost of an algorithm implementation,
given specific word-lengths w, whilst Section 2.1.4 describes methods for finding ‘good’
(i.e. optimal or near-optimal) solutions to particular word-length optimisation problems.
2.1 Word-length optimisation 35
Every technique will be discussed and compared with reference to the fundamental diffi-
culties in word-length optimisation highlighted above.
2.1.2 Calculation of algorithm accuracy
As was noted in the preceding section, the size of the search required to find ‘good’
solutions to word-length optimisation problems is so large that we cannot hope to find
such solutions to all but the most trivial of problems unless the accuracy and cost of
algorithm implementations can be estimated extremely quickly. This section deals with
existing methods that have been used to determine the accuracy of algorithms quickly
during word-length optimisation.
There are two critical factors that determine the nature of calculating the accuracy
of an algorithm: i) the scheme used to communicate numbers between operations, and ii)
the nature of the algorithm itself.
Fixed-point and floating-point are the two number representations that are most
commonly used for computation in modern computing. The difference between these two
representations is that in fixed-point, a fixed number of bits (binary digits) both to the
left and to the right of the binary point are stored, whilst in floating point a fixed number
of bits (called the significand) are multiplied by a base that is raised to an exponent
determined by a number of other bits called the exponent bits. Figure 2.1 depicts the
way in which the bits of these representations are organised and (2.2) and (2.3) show how
numerical values are interpreted from these bits, for fixed-point (where fj is the value of
the jth bit), and floating-point (where fj is the value of the jth bit of the significand and
xj is the value of the jth bit of the exponent), respectively.
fixed-point number = 2p−n(−fn−12n−1 +
n−2∑
i=0
fi2i) (2.2)
floating-point number = ±(1 +
nf−1∑
i=0
fi2i−nf )2(−xnx−12
nx−1+
∑nx−2
j=0 xj2
j) (2.3)
Three types of errors can arise when storing numbers in both fixed and floating-
36 LIST OF TABLES
f0 +/-
scaling p
fraction bits significand bits
word-length nf
Fixed-point
f1f2f3f4f5f6f7
Floating-point
exponent bits
word-length n
x
x1x2x3 f0f1f2f3f4f5f6f7x0
integer bits
sign bit
word-length n
Figure 2.1: The interpretation of bits stored in fixed and floating-point format. Equa-
tions (2.2) and (2.3) show how these bits are interpreted as numerical values. Each bit
stores either the value 0 or 1.
point, these are: i) overflow errors where the absolute value of a number is too large to
be stored in the given representation, ii) quantisation errors where there are insufficient
fraction bits (in fixed-point) or significand bits (in floating-point) to store all of the frac-
tional part of a number, and iii) underflow errors that can occur in floating point when a
number has a smaller magnitude than the smallest quantity representable.
Whilst quantisation error generally results in a small loss of accuracy, overflow
errors can be devastating as they can cause a stored result to ‘wrap-around’ i.e. a large
positive number would overflow and become a large negative number, or vice-versa, in two’s
complement (and in floating point the exponent can also ‘wrap-around’). As a result even
a single overflow error can cause complete loss of accuracy in a calculation, however an
accumulation of many small quantisation errors in the operations of an algorithm will also
cause a substantial loss of accuracy.
The different nature in which overflow and quantisation errors arise and the signif-
icantly different effects on accuracy that they have necessitates two different approaches
to mitigating their effects during word-length optimisation. It is generally best to ensure
that overflow either never occurs or only occurs with very small probability, whilst quan-
tisation noise arising at the output of many operations is permissable under the condition
that the total loss in accuracy is not so large as to prevent accuracy constraints placed
on the word-length optimisation problem from being met. The following subsections deal
separately with existing methods for dealing with overflow and quantisation errors.
2.1 Word-length optimisation 37
Overflow prevention: range analysis
If the range of values that the output of some operation may take is known then the
minimum number of bits required to store that range of values can be easily determined.
The minimum number of bits required to store the range of values taken by a particular
signal are typically called that signal’s scaling. Range analysis is the process of determining
the ranges of the outputs of all operations in an algorithm and various methods have been
used to do this in existing work on word-length optimisation, as listed below and explained
in the following.
• Simulation-based [SK95; CSPL01; RB05; BR05; GMLC04; Con03]
• Combination of simulation and statistics [KKS98; SB04; HE06; ONG04]
• Range propagation [BP00; WP98; SBA00; NHCB01; CRS+99]
• Affine arithmetic [ACS94; LGC+06]
• l1 scaling [CCL01; CT06]
Simulation-based. If test vectors are available that can be used to run a simulation (or
several simulations) of the algorithm that is to be optimised, then the range of values that
the outputs of each operation in an algorithm can take can be determined directly from
these simulations. In two’s complement fixed point representation, the number of integer
bits nint (i.e. to the left of the binary point) required to represent a value of magnitude
m is given by:
nint = blog2(m)c+ 1 (2.4)
However, unless every possible input sequence is used in the set of test vectors
to simulate the program, it will only be possible to explore a subset of all the possible
behaviours of the algorithm and the resulting range of values generated by operations
within the algorithm. One approach available to combat the lack of complete simulation
data is to increase the value of nint by some positive number of bits called guard bits.
Doing so will mean that overflow will hopefully be avoided in situations that may arise
38 LIST OF TABLES
that were either not conceived by the algorithm designers or for which there are no test
vectors available.
Clearly algorithm designers cannot hope to even identify every possible combination
and sequence of inputs that may be passed to their algorithm, let alone simulate the
outcome of the algorithm under every such possible sequence. As a result, although
the use of simulation as described above is common in word-length optimisation work
[SK95; CSPL01; RB05; BR05; GMLC04; Con03], there has been significant investigation
into techniques that can either calculate or estimate the range of values that the output of
operations in an algorithm may take without the need for test vectors that explore every
possible sequence of inputs that may drive an algorithm. Some of these methods are based
on using short simulations in combination with statistical techniques, as described below,
whilst yet others use analytical methods as described later within this subsection.
Combination of simulation and statistics. Methods that fall into this category use
statistical parameters measured from simulation to estimate the range of values that the
output of each operation in an algorithm may take. The main advantage of these methods
is that it may be possible to predict these ranges with a quantifiable level of confidence,
however these methods normally assume that the outputs of these operations can be mod-
elled by particular probability distributions, which may or may not be the case depending
on the inputs to and nature of the algorithm itself.
Work in [KKS98] first proposed such a technique, similar forms of which have
since been used in [SB04; HE06]. In [KKS98] the authors state that in signal processing
algorithms (to which the work is applied), simple probability distributions such as the
Gaussian distribution can typically be used to approximate the distribution of values at
the output of operations. For a Gaussian distribution, values within a range of four times
the standard deviation about the mean of the distribution occur 99.99% of the time. Hence
overflow will be avoided 99.99% of the time if the maximum value M that is expected to
occur for the output of a particular operation with mean µ and standard deviation σ is:
M = |µ|+ 4σ (2.5)
2.1 Word-length optimisation 39
−10 −5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
i) Gaussian distribution
pd
f(x
)
−10 −5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
ii) Skewed distribution
pd
f(x
)
−10 −5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
iii) Multi−modal distribution
pd
f(x
)
Figure 2.2: Three different types of distributions that the work in [KKS98] attempts
to estimate the range of using statistical parameters.
However the authors acknowledge that not all signals in practical systems can be
modelled in this way, and hence they make use of other statistical parameters in order to
detect and model the range of signals with more complex distributions, namely ones that
are skewed or that are multi-modal (i.e. having more than one peak in the distribution).
Figure 2.2 shows an example of both a skewed and a multi-modal distribution in plots ii)
and iii) respectively, the range of which would be difficult to model using the mean and
variance of a Gaussian distribution such as the one shown in plot i). In order to be able
to better model the range of distributions such as ii) and iii) in Figure 2.2 the authors
adopt the following approach.
To detect skewed or multi-modal distributions, the skewness and kurtosis of each
signal are measured. Non-zero skewness indicates that a distribution spreads more widely
to the left or right of its mean, whilst kurtosis indicates how many of a distribution’s
samples are close to the mean and becomes zero if the distribution is Gaussian. Skewed
distributions can thus be easily detected if the skewness of a distribution is measured as
non-zero, whilst the authors propose that multi-modal distributions can be detected if the
kurtosis k of a distribution lies outside of the range −1.2 < k < 5. For detected skewed
or multi-modal distributions then, the authors propose that the maximum value M that
requires representation can be estimated according to:
M = M99.9% + (M100% −M99.9%)× 2 (2.6)
40 LIST OF TABLES
where MP% represents the value that covers P% of the entire distribution. The motiva-
tion behind the use of this formula is that the greater the difference between M100% and
M99.9% (i.e. the greater the difference between the largest value measured and the one that
covers 99.9% of the measured distribution), the greater the maximum value that must be
represented.
Clearly the work in [KKS98] is based on empirical observations of a number of
signals from signal processing circuits. In [ONG04] an alternative range analysis method
based on the use of statistics is proposed that uses what is known as extreme value theory.
This is a branch of statistics that deals with extreme deviations from the median value of
probability distributions, where these extreme values can be modelled using one of only a
few different probability distributions [KN00].
In [ONG04] an algorithm must be simulated n times, with different input vectors
containing random values given each time. Each time the program is simulated the extreme
value (i.e. the value of largest magnitude) taken at the output of every operation is found,
thus after n simulations, n extreme values for each signal have been measured.
For each signal an extreme value distribution called the Gumbel distribution is
then fitted to the n measured extreme values from the n simulations. Each fitted Gumbel
distribution can then be used to estimate a range that should be used to ensure that a
chosen percentage of the values taken by the variable can be stored without overflow. The
greater the number of simulations n, the greater the confidence in the fit of the Gumbel
distribution and the extreme values predicted by it.
Range propagation. In the range propagation technique, the range of values that the
inputs of an algorithm may take must be specified by the user. Interval arithmetic is
then used to propagate ranges through the inputs of each computation to that compu-
tation’s output, without the need for simulating the algorithm. Propagation of ranges is
done through a simple technique where the worst case range possible at the output of an
arithmetic operation is calculated, given the range of the inputs to the component and the
component’s function. A set of rules exist that determine how this propagation should be
2.1 Word-length optimisation 41
performed for each type of arithmetic operation. For example, given an adder with ranges
[−1, 3] and [−4, 3] for its two inputs, where in each case the first number shown in square
brackets is the lower extent of the input range and the second number shown is the upper
extent of the range, the worst case output range would be [−1− 4, 3 + 3], i.e. [−5, 6].
Ranges are propagated forwards through an algorithm from its inputs to its output.
This allows worst case estimation of the range of values that each signal in the algorithm
may take, and hence how many bits are needed so as to avoid overflow. This simple
technique is used in [BP00; WP98].
The range propagation technique has several problems. Firstly if the inputs to an
operation are correlated with each other, range propagation is unable to account for this
and can over-estimate the range of the output of the operation as a result. Secondly if
an algorithm contains feedback loops (such as those present in an IIR filter) then range
propagation is inadequate as ranges are repeatedly propagated around the feedback loop
which will result in non-convergence in the range of variables used in that loop.
In [SBA00; NHCB01] range propagation is extended by proposing that the range
of variables can be propagated backward through an algorithm, i.e. from outputs towards
inputs, as well as forwards. This is possible where an operation or statement at some
point in an algorithm implies or restricts a variable to lie within a particular range, such
as when:
• using a variable as an index to an array of known bounds, or
• using a conditional statement which limits a variable’s value if it is above some
threshold.
When such situations arise in an algorithm the restricted range of the variable can be
propagated backwards through the algorithm to previous operations where the variable
was used, up to the operation where the variable was created. However this technique
relies on restricted ranges implied by the user in the cases outlined above, and so cannot
be used to automatically improve the over-estimated ranges caused by the correlation
problem, or the problem of ranges exploding around a feedback loop.
42 LIST OF TABLES
In [CRS+99] it is proposed that range propagation be combined with simulation of
an algorithm. The range of signals in the simulated algorithm is monitored and, following
simulation, the range of each signal is compared to the one obtained from range prop-
agation. Identical ranges for a particular signal indicate the two techniques match and
that the calculated range should be used for that variable. Where the range from range
propagation is larger than the one from simulation, this may be because either the input
vector used for simulation is not ‘good’ enough to stimulate the entire range of values that
the signal may take in practice, or because range propagation has been pessimistic in its
prediction because of the correlation problem or the feedback loop problem.
When range propagation only predicts a slightly larger range than simulation, it is
suggested that the choice of action be left to the user; however when range propagation
predicts a much larger range, it is suggested that the range measured in simulation is used
along with saturation arithmetic in the hardware design (which uses extra logic to avoid
the ‘wrap-around’ from positive to negative values, and vice versa, that normally occurs
in overflow errors), to mitigate the high error of overflow.
Affine arithmetic. Affine arithmetic is a technique proposed in [ACS94] and used in
[LGC+06] in order to estimate the range of values taken by the output of operations
in an algorithm without simulation. The technique can be seen as an improvement to
interval arithmetic in that correlation between the internal variables of an algorithm can
be accommodated to some extent. In affine arithmetic the range of a variable x is expressed
in an affine (i.e. first-degree polynomial) form, as in (2.7).
range(x) = x0 + x11 + x22 + ...+ xnn (2.7)
In (2.7), xi are known coefficients, whilst i are symbolic variables that represent an
independent source of error or uncertainty. Each i has an unknown value in the interval
[-1, +1].
As in interval arithmetic, ranges in affine arithmetic can be propagated through
the operations of an algorithm using a set of rules. For example, consider the subtraction
2.1 Word-length optimisation 43
of two variables, a− b, whose ranges are stated below:
range(a) = 1 + 21 + 2 i.e. [−1, 3] (2.8)
range(b) = 3 + 11 i.e. [2, 4] (2.9)
The resulting range is given by:
range(a− b) = −2 + 1 + 2 (2.10)
Notice that because the symbolic error variable 1 is present in both variables a and
b (because the variables are correlated), 1 cancels out to some extent in the resulting
range. The resulting range according to affine arithmetic is [-4,0], whilst according to
interval arithmetic it is [-5,2]. Affine arithmetic can account for the correlation between a
and b and so provides tighter bounds on the range of a− b in this case.
Affine arithmetic can exactly calculate the output range of affine operations such
as addition, subtraction and multiplication by a constant, however operations that are
not affine (multiplication of two variables, for example) result in ranges that contain cross
terms between symbolic error variables (e.g. 12 would occur for a × b), that are not
affine. For such operations it is necessary to approximate the cross terms of symbolic
error variables with new error variables that are affine and that also account for the error
introduced by the affine approximation. Inevitably, these approximations lead to a reduc-
tion in the accuracy computed for ranges of non-affine operations, reducing the advantage
of this (more computationally complex) technique over simple interval arithmetic. Also
as with interval arithmetic, data feedback loops cause problems as the ranges of variables
within the loops can grow larger and larger on each pass over the loop.
l1 scaling. The l1 scaling, originally described in [Jac70] is a range analysis technique
for digital signal processing algorithms that is used in [CCL01; CT06]. Where a transfer
function exists between the inputs and each operation within the algorithm, i.e. as is
the case for Linear Time Invariant (LTI) algorithms like Finite Impulse Response (FIR)
and Infinite Impulse Response (IIR) filters, the l1 scaling can be calculated and used to
44 LIST OF TABLES
obtain the maximum value that can occur at the output of each operation within the
LTI algorithm, given the maximum possible value at the algorithm’s input. Correlation
between the signals within an algorithm is accounted for implicitly in the transfer functions
between the algorithm inputs and the output of each operation in the system.
The method used can be summarised as follows. The output yk[n] at time n of the
operation k of an LTI algorithm can be calculated by the convolution of the algorithm’s
input signal x with the impulse response hk of the transfer function between the algorithm’s
input and the output of operation k:
yk[n] = (x ∗ hk)[n] =
∞∑
i=−∞
x[i] · hk[n− i] (2.11)
If it is known that the input x lies in the range −M to +M , we can formulate a
test input signal xt that maximises the result of the convolution in (2.11), and hence the
largest absolute value that may occur at the output of operation k can be determined.
The input signal xt can be calculated by setting its values xt[i] as in (2.12) (which reduces
to (2.13)) to maximise the convolution above, for the output of a particular operation k
at time n.
If hk[n− i] > 0, set xt[i] = +M
If hk[n− i] < 0, set xt[i] = −M
(2.12)
max(yk) = M ·
∞∑
i=−∞
|hk[i]| (2.13)
We can thus analytically obtain a worst case range for each signal in an LTI algo-
rithm, even in the presence of correlation and feedback loops, as these are accounted for
implicitly by the transfer functions between the inputs and outputs of each operation of
an LTI algorithm. Clearly though this method cannot be applied to algorithms that are
not Linear Time-Invariant.
2.1 Word-length optimisation 45
Quantisation noise analysis
If there are not enough fraction bits in the binary representation of a signal to store
all of the fractional part of a number, quantisation noise is introduced into the signal.
Quantisation noise analysis is the process of determining how much quantisation noise
has been introduced into the output of an algorithm due to quantisations within the
algorithm itself. The following methods for quantisation noise analysis listed below have
been presented in previous work. These accommodate for different algorithm types and
application domains for which different measures of algorithm accuracy are necessary.
• Simulation [SK95; CSPL01; RB05; BR05; KKS98; HE06]
• Multi-interval arithmetic [BP00]
• Affine arithmetic [LGC+06]
• L2 norm [CCL99; CT06]
• Perturbation analysis [Con03]
• Taylor-expansion based methods [WP98; GMLC04]
Simulation. The most common method for establishing the quantisation noise intro-
duced into an algorithm due to word-length selection is to simulate the algorithm using
the given set of word-lengths and compare its outputs to those achieved by performing the
algorithm under infinite precision. Infinite precision cannot be achieved but instead the
use of some arbitrarily high precision such as double precision floating point for all signals
in an algorithm is normally sufficient.
Given both the output signal from a quantised version of an algorithm and the ‘in-
finite’ precision version of the algorithm, it is possible to use any appropriate method for
measuring the difference (or similarity) between these two signals to determine how accu-
rate the quantised version of the algorithm is. Depending on the target application domain
different measures of accuracy may be appropriate; in DSP applications for example the
fidelity of a signal is frequently measured by the ratio of signal power to quantisation noise
46 LIST OF TABLES
power, i.e. the signal to noise ratio (SNR) of the signal. Other measures of accuracy may
be appropriate for other application domains such as communication systems or scientific
computing applications, but these can also be calculated from the outputs of a quantised
and ‘infinite’ precision system.
Simulation has another significant advantage as a method for noise analysis as it
is possible to apply the method to any known algorithm in order to gather the outputs
of both quantised and ‘infinite’ precision outputs from the algorithm. This is in stark
contrast to other techniques that are currently known that can only estimate quantisation
noise on particular classes of algorithm, as will be seen later in this subsection.
Due to its universal applicability, simulation is used as a method for performing
quantisation noise analysis in a number of existing word-length optimisation methods
[SK95; CSPL01; RB05; BR05; KKS98; HE06]. However there is a significant drawback to
simulation that, in conjunction with the enormous search space involved in word-length
optimisation, makes its use as a method for quantisation noise analysis difficult to justify.
As was noted for the use of simulation in range analysis, in general it is not possible for
designers to conceive of every possible sequence of inputs that may drive their algorithms,
however test vectors that capture as much of the expected behaviour as possible are
essential in order to ensure accurate measurements are made for quantisation noise analysis
(or indeed for range analysis).
The computational complexity of using simulation for quantisation noise analysis
is thus high as long simulation vectors are necessary due to the above. Additionally a
quantised algorithm requires many extra operations on modern computer processors (on
which we assume algorithm simulation is performed) in order to account for the truncations
and shifts that are inherent in multiple word-length systems.
It has already been explained that the search space involved in word-length optimi-
sation problems can be huge. Both noise analysis and cost evaluation must be performed
at every point in the search space in order to find a globally optimum solution to the
word-length optimisation problem, and hence if simulation is used we expect that solving
all but the most trivial of word-length optimisation problems will not be possible. The
2.1 Word-length optimisation 47
remainder of the noise analysis techniques introduced in this subsection concentrate on
analytical methods for estimating the quantisation noise in an algorithm without using
simulation and hence these provide more promising ways of solving larger word-length
optimisation problems.
Multi-interval analysis. Strictly speaking this method is not a quantisation noise cal-
culation technique as it only allows the calculation of the word-lengths necessary to in-
troduce no truncation noise into an algorithm. Nevertheless it has been used in previous
word-length optimisation work [BP00] and hence is relevant to this section.
Multi-interval analysis is the LSB-side equivalent to interval arithmetic, which was
used for range analysis. The aim of this method is to allow the smallest possible value
that needs to be represented in the result of each operation to be propagated through the
output of each operation of an algorithm. This is done by propagating the smallest absolute
values that can occur at each of the inputs to the algorithm through each operation in the
algorithm, in a similar way to interval arithmetic.
In order to achieve this, multi-interval analysis proposes using a union of three
intervals to represent the range of a variable, rather than the single range used in interval
arithmetic (containing a lower and upper bound only). As the technique is aimed at two’s
complement fixed point representation, the three ranges are constructed as follows: the
first contains the most negative number that can be represented, and the negative number
of the smallest magnitude which can be represented. The second interval always contains
the number zero only, and the third contains the smallest and largest positive numbers
which can be represented.
For each type of arithmetic operation there is a corresponding rule that calculates
the largest and smallest numbers that need to be represented at the output of that op-
eration in order to store the result. As in range propagation however problems occur in
feedback loops: we can expect the smallest magnitude number that needs to be repre-
sented to become smaller and smaller each time the loop is traversed if operations such as
multiplications are used within the loop.
48 LIST OF TABLES
Because this method merely calculates the smallest possible word-length that can
be used to avoid quantisation it cannot be used as part of the search of the design space of
a word-length optimisation problem. If the algorithm designer wishes to sacrifice a small
amount of accuracy in order to make further reductions in the cost of an algorithm this
technique is not appropriate.
Affine arithmetic. In the work presented in [LGC+06] the authors use affine arithmetic
to determine the effects of quantisation errors propagated to the output of an algorithm.
The worst-case quantisation error is estimated for every quantised signal, which in turn
allows the worst case error at the outputs of a quantised algorithm to be estimated. The
worst-case error is an error metric that is useful for scientific computing applications where
algorithm designers wish to ensure results have a guaranteed level of accuracy, as if the
worst case error in a signal is 2−3, for example, then two’s complement bits that represent
values larger than this worst case error value will always be correct.
Affine arithmetic is used to propagate worst-case quantisation errors forwards from
the inputs of an algorithm to its outputs, in a similar way to how affine arithmetic is
used for range analysis. This is summarised as follows. The two’s complement quantised
version x˜ of a signal x is represented in affine form as:
x˜ = x+ 2−fx˜−1 (2.14)
where fx is the number of fraction bits used to represent x˜ and  is a symbolic error
variable of unknown value in the range [-1, 1]. Thus the error Ex˜ in x˜ due to quantisation
is given by:
Ex˜ = 2−fx˜−1 (2.15)
There exists a method for propagating quantisation errors through each type of
arithmetic operation, as shown below for the addition or multiplication of two variables x
2.1 Word-length optimisation 49
and y:
Ex˜+y˜ = Ex˜ + Ey˜ + x˜+y˜2−fx˜+y˜−1 (2.16)
Ex˜×y˜ = yEx˜ + xEy˜ + Ex˜Ey˜ + x˜×y˜2−fx˜×y˜−1 (2.17)
where fx˜+y˜, fx˜×y˜ represent the number of fraction bits used to represent the output of
x˜ + y˜ and x˜ × y˜, respectively, and x˜+y˜, x˜×y˜ are symbolic error variables that represent
the uncertainty due to the quantisation that occurs at the output of x˜ + y˜ and x˜ × y˜,
respectively.
Note that the calculation of Ex˜×y˜ requires the products yEx˜ and xEy˜ to be cal-
culated. In order for the worst case error at the output of the quantised algorithm to
be calculated, the maximum values of x and y should be used in these products in order
to obtain the maximum possible value of Ex˜×y˜. Note that the maximum and minimum
values of all signals within an algorithm are already known from range analysis by the
time it is necessary to perform quantisation noise analysis, so the maximum values of x
and y would be easily available for quantisation noise calculations.
This technique allows the worst case error at the outputs of an algorithm to be
expressed in an analytical form, i.e. as the sum of quantisation error contributions from
the quantisation of each signal within the algorithm. Clearly the calculation of the worst
case quantisation error in an algorithm will be much less computationally expensive using
this technique than through the use of simulation. However this method cannot be used
on algorithms that contain feedback loops as it propagates quantisation errors strictly for-
wards. This technique would repeatedly accumulate quantisation errors around feedback
loops, as with other interval analysis techniques.
L2 norm. The L2 norm is a method for estimating the variance of the noise at the
output of a DSP algorithm that is due to quantisation within the algorithm and is used in
[CCL01; CT06]. The method was originally described in [Jac70], where it is shown that
the variance σ2y of the noise at the output y of a DSP algorithm due to quantisation within
50 LIST OF TABLES
the algorithm is given by:
σ2y =
∑
i∈I
λiyσ
2
i (2.18)
where I is the set of all quantised signals within the algorithm, λiy is a scaling factor that
is determined from the transfer function between signal i and output y, and σ2i is the
variance of the noise injected into the algorithm due to quantisation of signal i.
The variance σ2i of the noise injected into a two’s complement representation of a
signal i with scaling p = 0 due to quantisation is typically estimated as:
σ2i =
1
12
2−2n (2.19)
where n is the word-length of a signal, by assuming the injected noise is uniformly dis-
tributed, white and uncorrelated [OS75]. This estimate of the variance of two’s comple-
ment quantisation noise is used in the word-length optimisation work proposed in [CT06].
However when the word-length of a signal is shortened by only one or two bits this model
is invalid as the noise introduced is discrete, and as shown in [CCL99] is better modelled
by:
σ2i =
1
12
22p(2−2n2 − 2−2n1) (2.20)
where p is the scaling of the signal (determined by range analysis), and n1 and n2 are the
original and shortened word-lengths used, respectively.
By estimating the noise variance injected into each signal in a DSP algorithm via
(2.19) or (2.20) and then using the L2 norm (2.18) it is possible to estimate the variance
of the quantisation noise at the algorithm’s output analytically. The variance of the
quantisation noise at the DSP algorithm’s output can then be related to the output’s
signal to noise ratio (SNR), if necessary. Clearly this technique based on the L2 norm
will provide much faster estimates of algorithm accuracy than simulation, however the L2
norm can only be calculated when transfer functions exist between each quantised signal
in the algorithm and the algorithm’s output, i.e. for LTI systems only.
2.1 Word-length optimisation 51
Perturbation analysis. This technique, proposed in [Con03], uses a combination of
simulation and analytical techniques to estimate the noise introduced into the output of
a quantised non-linear DSP algorithm. The premise of this work is the assumption that
quantisation noise introduced into a system is too small to affect the macroscopic behaviour
of the system. As a result, each operation in a system can be locally linearised, i.e. replaced
by its “small scale equivalent” [SS91], and, once each operation in an algorithm has been
linearised in this way, the effects on the algorithm’s output (i.e. it’s “sensitivity”) due to
small perturbations introduced into each signal within the algorithm can be analysed.
Linearisation takes place via the following method. Given a differentiable n-input
operation y[t] = f(x1[t], x2[t], ..., xn[t]), where t is a time index, then if small perturbations
4x1[t], 4x2[t], ..., 4xn[t] are introduced into each input variable of f , the resulting
perturbation at the output of the function can be approximated as a first-order Taylor
expansion of f , i.e.
4y[t] ≈ 4x1[t] ∂f
∂x1
+4x2[t] ∂f
∂x2
+ ...+4xn[t] ∂f
∂xn
(2.21)
The first-order Taylor expansion of f is a linear function in terms of its inputs
xi, though the Taylor coefficients ∂f∂xi can vary with time. All differentiable non-linear
operations in a system can be replaced according to (2.21), allowing the “small-signal”
effects in a non-linear time invariant algorithm to be modelled in a linear time-varying
algorithm.
Once all non-linear operations have been replaced according to (2.21), the sensitiv-
ity of the algorithm’s output to noise injected into each signal within the algorithm must
be measured. This is done by simulating the linearised algorithm once for every signal
in the algorithm, where for each simulation noise is injected into a specific signal in the
algorithm, and the response to this noise injection at the algorithm’s output measured.
As the “small-signal” model is linear, if the algorithm’s output exhibits a noise
variance of V when a particular signal is disturbed by a noise signal of variance σ2 then
the output will exhibit noise of variance αV when the same signal is disturbed by noise
52 LIST OF TABLES
of variance ασ2. Hence if the sensitivities si of the algorithm’s output (i.e. the variance
of its response to noise of unit variance injected individually into each signal i ∈ S, where
S is the set of signals in the algorithm) is measured, then the output noise variance σ2y in
response to noise of variance σ2i injected into each signal i ∈ S is given by:
σ2y =
∑
i∈S
siσ
2
i (2.22)
The sensitivities si of the algorithm’s output in response to noise injected into each
signal i ∈ S can be determined once only by simulating the algorithm |S| times. After
this has been achieved the output noise variance can be estimated using the sensitivities
si along with (2.22) and the estimated noise variances of each quantised signal in the
algorithm, which can be determined from an assigned set of word-lengths according to
the quantisation noise model in [CCL99], described in the subsection on the L2-norm
above and expressed in (2.20). Although the initial cost of calculating the sensitivities of
signals is high, this needs to be done once only, after which quantisation noise variance
and hence algorithm accuracy can be determined analytically for different word-lengths
at low computational cost.
Other Taylor expansion methods. In [GMLC04] a technique called automatic dif-
ferentiation is used to establish the sensitivities of the output of a non-linear algorithm
to quantisation noise in the signals of the algorithm. Once signal sensitivities are known,
the maximum output error can be calculated given a set of word-lengths to use for the
system. The automatic differentiation technique uses known partial derivatives of ele-
mentary arithmetic operations to annotate the operations within an algorithm with their
partial derivatives with respect to their inputs. First-order Taylor approximations of the
sensitivities of the output of an algorithm to internal quantisation can thus be established
and are used to determine algorithm accuracy by estimating the maximum possible output
error due to quantisation. The work in [WP98] also derives a first-order Taylor approxi-
mation of the operations within an algorithm and uses forward-propagation to establish
the maximum error at the algorithm’s output as a result.
2.1 Word-length optimisation 53
These methods both benefit from low computational cost in estimating algorithm
accuracy, but suffer from not being able to function on algorithms that contain feedback
loops and additionally cannot make use of time-varying information obtained through
simulation in, for example, the perturbation analysis method.
Summary
From the literature review conducted above it should be clear that establishing the accu-
racy of quantised algorithms is a complex problem to which many attempts at providing
solutions for different algorithm types and application domains have been attempted in
existing work. The following subsection is concerned with the calculation of the cost of
the selection of word-lengths in an algorithm. The analysis of the accuracy of quantised
algorithms will later be revisited in Section 2.1.5 where a summary of existing word-length
optimisation work is tabulated for ease of comparison.
2.1.3 Calculation of cost
Word-length optimisation has mainly been applied to the minimisation of the silicon area
consumed by algorithm implementations. Although its effects on the power consumption
and computational throughput of these implementations has been noted [Con03; SBA00;
LGC+06], with particularly important improvements seen in power consumption, no efforts
have been made to target these directly.
The main reason for the bias towards word-length optimisation techniques for area
minimisation is that the input word-lengths of arithmetic operators have clear and well
studied links to the silicon area they consume. In contrast the power consumption and
throughput of an algorithm are related in more complex ways to signal word-lengths.
Due to the nature of the power consumption or throughput of an implementation of an
algorithm they can typically only be established through the elaboration and analysis of
the implementation, which is normally highly computationally expensive and hence would
allow only the smallest of these types of word-length optimisation problems to be solved.
Typical measures of the area cost of quantised versions of an algorithm used in
54 LIST OF TABLES
existing word-length optimisation work are: i) the sum of the word-lengths of all the signals
in an algorithm, or ii) some estimate of the area occupied by the quantised algorithm, using
knowledge available on the logic required to implement arithmetic operations of specific
word-lengths, as is done in [WP98; KKS98; SBA00; Con03; GMLC04; LGC+06]. Both of
these measures of the area consumed by an implementation of a quantised algorithm have
a low computational cost.
Due to the relative ease with which the number of logic resources required by
arithmetic operations can be estimated, specific techniques for doing so have not been
the focus of any existing word-length optimisation work. Nevertheless, as estimating the
area consumed by an implementation of an algorithm is relevant to the work in this
thesis Section 3.2 summarises simple methods for doing so when the target computational
platform is a reconfigurable device.
As will be seen however, estimating power consumption quickly is a much more
difficult task than estimating the area consumed by an implementation of an algorithm.
It is the aim of this thesis to present the first known multiple word-length optimisation
work on power consumption minimisation, and as power consumption estimation has not
been applied to this problem before existing work in that area is dealt with separately
in Section 2.3.2. The following subsection deals with the cause of the necessity for fast
accuracy and cost estimation during word-length optimisation: the complex problem of
word-length selection.
2.1.4 Word-length selection
The preceding sections have highlighted relevant work on estimating the accuracy and
cost of implementations of algorithms, given a particular set of word-lengths. A significant
amount of work on word-length optimisation has been dedicated to these problems because
of the necessity for fast methods of obtaining these estimates.
Despite the availability of fast methods for estimating the accuracy and area of im-
plementations of an algorithm, the underlying complexity in selecting the optimal word-
lengths to minimise a given cost function remains. As was stated in Section 2.1.1, the
2.1 Word-length optimisation 55
search space involved for non-trivial word-length optimisation problems is enormous as
every word-length that is selected during optimisation adds an extra dimension to the
problem. Due to the complexity of finding optimal solutions to word-length optimisation
problems a number of optimal and near-optimal methods have been proposed in exist-
ing work that fall under the following categories and are summarised in the following
subsections.
• Uniform word-length optimisation
• Full search/ILP formulation [CCL02]
• Application specific heuristics [KKS98; CCL01]
• Simulated annealing [LGC+06]
• Geometric programming [CT06]
Uniform word-length optimisation. Uniform word-length optimisation is a greatly
simplified version of the multi-dimensional word-length optimisation problem. Under uni-
form word-length optimisation the same word-length is used for all signals in an algorithm,
reducing the dimensionality of the problem to a single dimension. Optimal solutions to
the uniform word-length optimisation problem are generally not very ‘good’ solutions to
the multiple word-length optimisation problem however as there tend to be multiple word-
length solutions that offer much lower cost for the same level of algorithm accuracy.
Due to the simplicity involved in finding the minimum uniform word-length solu-
tion for a word-length optimisation problem however this method has in the past been
used by algorithm designers for performing word-length optimisation. Since the advent of
techniques for automatic word-length optimisation, algorithms optimised using uniform
word-length optimisation have been frequently used as the baseline against which multiple
word-length optimisation is compared. Work on automatic word-length optimisation indi-
cates reductions in the silicon area consumed by an algorithm of up to 80% can be achieved
by using multiple word-length optimisation instead of uniform word-length optimisation
[Con03].
56 LIST OF TABLES
The minimum uniform word-length that satisfies a given accuracy constraint can
be found in O(log2wlmax) tests of the search space by using the binary search algorithm
[Knu97b], where wlmax is a large uniform word-length that is known to satisfy the given
accuracy constraint. Clearly optimal solutions to uniform word-length problems are easily
found, but the improvements in the cost of algorithm implementations offered when using
multiple word-lengths make them much more attractive, despite the simplicity of uniform
word-length optimisation.
Full search. As explained in Section 2.1.1, optimal solutions to the multiple word-length
optimisation problem can only be guaranteed to be found by testing all (wlmax)n points
in the search space. In [CCL02] optimal word-length selection is formalised as a Mixed
Integer Linear Program (MILP), though these optimisation problems are also NP-hard.
Although the MILP formulation can take advantage of Branch and Bound techniques so
that testing all (wlmax)n points in the search space is not necessary, the number of tests
required is still exponential in n. Hence in practice heuristic methods are necessary to
find near optimal solutions to word-length optimisation problems.
Application specific heuristics. A variety of heuristics have been proposed in existing
work that are generally able (but not guaranteed) to find near-optimal solutions to word-
length optimisation problems. These heuristics can be summarised as follows.
The ‘exhaustive search’ method proposed in [KKS98] starts from a point where
accuracy constraints are not met because the word-lengths used are too small. This
starting point is found by finding, for each signal, the smallest word-length that can be
assigned to the signal that does not cause a lower level of accuracy than that required,
whilst all other signals use ‘infinite’ precision. Once this starting point is identified, word-
lengths are increased iteratively until a feasible solution is found. At each iteration the
word-length of the signal that will give the smallest increase in system cost is increased.
The ‘greedy heuristic’ method from [CCL01] starts from the point where all word-
lengths use the minimum uniform word-length, and gradually decreases word-lengths until
no more word-lengths can be decreased without guaranteeing the given level of algorithm
2.1 Word-length optimisation 57
accuracy. At each iteration, the word-length that would return the greatest reduction in
cost if reduced to the point where accuracy constraints were broken is reduced by one bit.
Due to the ‘greedy’ nature of both of these heuristics, i.e. the fact that they choose
to gradually increase or decrease the word-length of the least costly or most costly signal,
respectively, means they are likely to get caught in local-optima, as they will not make
moves that increase the cost of a system. However they appear to perform well in practice:
the ‘greedy heuristic’ method from [CCL01] has been shown to lie to within 0.7% of
optimal solutions on average, for a set of small benchmark systems, whilst it is able to
quickly converge to solutions for larger systems [CCL02].
Simulated annealing. Simulated annealing is a stochastic optimisation technique for
finding a good approximation to the global solution to an optimisation problem and has
been used in [LGC+06] for finding good solutions to word-length optimisation problems.
Simulated annealing is an iterative algorithm that tests a random set of points around the
current search point, and will move to one of these random points at the next iteration. If
the cost of the new point is greater than that of the current point however, the algorithm
moves to the new point with a certain level of probability. This level of probability of
moving to a worse solution decreases with the number of iterations of the algorithm so
that eventually the algorithm only moves to better solutions and converges to a final result.
Simulated annealing has an advantage over the greedy heuristics described above
as it is possible for it to leave local optima by accepting worse solutions. In [LGC+06]
the authors provide results indicating that using simulated annealing they are able to find
near-optimal solutions that are within 1% of optimal solutions for small systems in the
fraction of the time it takes to find an optimal solution.
Geometric programming. The word-length selection method in [CT06] finds the op-
timum solution to a relaxed version of the word-length optimisation problem where word-
length variables are not required to be integers. Due to this relaxation geometric pro-
gramming can be used to find the optimal solution to the non integer problem because
both the constraint and cost functions used are convex, i.e. any minimum found is the
58 LIST OF TABLES
global minimum of the problem. Convex optimisation techniques can be used to quickly
solve geometric programming problems by using, for example, the gradients of cost with
respect to word-length of a current solution to point the algorithm closer to the global
minimum [BV04].
Geometric programming can only be applied to solve problems where the objective
and constraint functions are posynomial, i.e. they must be of the form:
p(x) =
K∑
k=1
ckx
a1k
1 x
a2k
2 ...x
aNk
N (2.23)
Where K, ck and aik are constants, and ck > 0, k ∈ K. The restriction of objective
and constraint functions to posynomial form means that geometric programming cannot
be used to model certain word-length optimisation problems (for example, the power
consumption models for embedded multipliers and for routing power described in this
thesis cannot be reduced to posynomial form).
Additionally the non integer solution to the geometric programming formulation of
word-length optimisation in [CT06] is not a valid solution to the word-length optimisation
problem, as all word-lengths must be integer values. The authors propose rounding up the
geometric programming solution to the nearest integer to obtain valid word-lengths, but
this solution is not necessarily the optimal integer solution, as the rounded-up solution has
a higher accuracy than expressed by the given accuracy constraints and so improvements
to this solution are possible.
2.1.5 Summary
The preceding sections have described the complexity of the search space of word-length
optimisation problems, existing methods for exploring it and for quickly estimating the
accuracy and cost of implementations of algorithms in order to ensure word-length opti-
misation can be performed in feasible amounts of time. This section presents a summary
of the main properties of the methods described in Tables 2.1, 2.2 and 2.3 to allow them
to be easily contrasted.
2.1 Word-length optimisation 59
T
ab
le
2.
1:
A
co
m
p
ar
is
on
of
ra
n
ge
an
al
y
si
s
te
ch
n
iq
u
es
.
T
ec
h
n
iq
u
e
M
et
h
o
d
C
om
m
en
ts
an
d
d
ra
w
b
ac
k
s
Si
m
ul
at
io
n-
ba
se
d
[S
K
95
;
C
SP
L
01
;
R
B
05
;
B
R
05
;
G
M
L
C
04
;
C
on
03
]
Si
m
ul
at
e
th
e
al
go
ri
th
m
,
m
ea
su
re
th
e
ra
ng
e
of
al
l
si
gn
al
s.
U
se
gu
ar
d
bi
ts
if
ne
ce
ss
ar
y.
A
pp
lic
ab
le
to
al
l
al
go
ri
th
m
s.
D
iffi
cu
lt
y
of
se
le
ct
in
g
si
m
ul
at
io
n
ve
ct
or
s
th
at
ex
er
ci
se
th
e
al
go
ri
th
m
:
lo
ng
er
ve
ct
or
s
in
cr
ea
se
(t
he
al
re
ad
y
si
gn
ifi
ca
nt
)
co
m
pu
ta
ti
on
al
co
m
pl
ex
it
y.
C
om
bi
na
ti
on
of
si
m
ul
at
io
n
&
st
at
is
ti
cs
[K
K
S9
8;
SB
04
;
H
E
06
;
O
N
G
04
]
M
ea
su
re
st
at
is
ti
ca
l
pa
ra
m
et
er
s
of
si
gn
al
s
du
ri
ng
si
m
ul
at
io
n,
us
e
th
es
e
to
es
ti
m
at
e
th
e
ra
ng
e
of
si
gn
al
s.
P
os
si
bi
lit
y
of
us
in
g
ex
tr
em
e
va
lu
e
th
eo
ry
to
es
ti
m
at
e
ra
ng
es
th
at
w
ill
gi
ve
ch
os
en
ra
te
s
of
ov
er
flo
w
in
[O
N
G
04
].
B
as
ed
on
as
su
m
pt
io
ns
th
at
ch
os
en
P
D
F
s
fit
th
os
e
of
th
e
si
gn
al
s
in
an
al
go
ri
th
m
.
R
an
ge
s
st
ill
ti
ed
to
ch
oi
ce
of
si
m
ul
at
io
n
ve
ct
or
s.
C
om
pu
ta
ti
on
al
ly
ex
pe
ns
iv
e
si
m
ul
at
io
n
st
ill
re
qu
ir
ed
.
R
an
ge
pr
op
ag
at
io
n
[B
P
00
;
W
P
98
;
SB
A
00
;
N
H
C
B
01
;
C
R
S+
99
]
P
ro
pa
ga
te
w
or
st
ca
se
up
pe
r
an
d
lo
w
er
bo
un
ds
th
ro
ug
h
ea
ch
op
er
at
io
n
of
an
al
go
ri
th
m
.
Fa
st
an
al
yt
ic
te
ch
ni
qu
e.
R
es
tr
ic
te
d
to
fe
ed
-f
or
w
ar
d
al
go
ri
th
m
s
(c
an
no
t
ac
co
m
m
od
at
e
fe
ed
ba
ck
).
C
an
no
t
ac
co
un
t
fo
r
co
rr
el
at
io
n
be
tw
ee
n
va
ri
ab
le
s.
A
ffi
ne
ar
it
hm
et
ic
[A
C
S9
4;
L
G
C
+
06
]
P
ro
pa
ga
te
sy
m
bo
lic
er
ro
r
va
ri
ab
le
s
th
at
re
pr
es
en
t
in
pu
t
ra
ng
es
th
ro
ug
h
an
al
go
ri
th
m
.
Fa
st
an
al
yt
ic
te
ch
ni
qu
e.
V
ar
ia
bl
es
ca
n
ca
nc
el
ou
t
in
th
e
pr
es
en
ce
of
co
rr
el
at
io
n.
O
ut
pu
t
ra
ng
e
va
ri
ab
le
s
of
no
n-
affi
ne
op
er
at
io
ns
m
us
t
be
ap
pr
ox
im
at
ed
.
R
es
tr
ic
te
d
to
fe
ed
-f
or
w
ar
d
al
go
ri
th
m
s.
l 1
sc
al
in
g
[C
C
L
01
;
C
T
06
]
U
se
tr
an
sf
er
fu
nc
ti
on
s
be
tw
ee
n
al
go
ri
th
m
in
pu
ts
an
d
ea
ch
in
te
rn
al
si
gn
al
to
de
te
rm
in
e
w
or
st
-c
as
e
si
gn
al
ra
ng
es
.
Fa
st
an
al
yt
ic
te
ch
ni
qu
e.
Im
pl
ic
it
ly
ha
nd
le
s
co
rr
el
at
io
n
w
it
hi
n
an
al
go
ri
th
m
.
C
an
ac
co
m
m
od
at
e
fe
ed
ba
ck
lo
op
s.
R
es
tr
ic
te
d
to
LT
I
al
go
ri
th
m
s.
60 LIST OF TABLES
T
ab
le
2.2:
A
com
p
arison
of
q
u
an
tisation
n
oise
an
aly
sis
tech
n
iq
u
es.
T
ech
n
iq
u
e
M
eth
o
d
C
om
m
en
ts
an
d
d
raw
b
ack
s
Sim
ulation-based
[SK
95;
C
SP
L
01;
R
B
05;
B
R
05;
H
E
06]
Sim
ulate
quantised
algorithm
and
com
pare
output
to
that
from
infinite
precision
sim
ulation.
A
pplicable
to
any
algorithm
and
to
any
accuracy
m
easurem
ent.
H
igh
com
putational
com
plexity.
D
iffi
culty
of
selecting
sim
ulation
vectors.
M
ulti-interval
analysis
[B
P
00]
P
ropagation
of
sm
allest
representable
absolute
values
through
an
algorithm
via
interval-based
m
ethod.
Fast
analytic
technique.
C
annot
trade
off
accuracy
to
low
er
cost.
Feed-forw
ard
algorithm
s
only.
A
ffi
ne
arithm
etic
[L
G
C
+
06]
P
ropagate
sym
bolic
error
variables
that
represent
w
orst
case
errors
through
a
circuit,
then
construct
analytical
form
for
w
orst
case
output
error.
Fast
analytic
technique.
E
stim
ates
w
orst
case
error
at
outputs
of
an
algorithm
.
Feed-forw
ard
algorithm
s
only.
L
2
norm
[C
C
L
01;
C
T
06]
E
stim
ate
quantisation
noise
in
each
signal,
then
use
transfer
functions
betw
een
internal
signals
and
algorithm
outputs
to
calculate
total
quantisation
noise
variance
at
output.
Fast
analytic
technique.
E
stim
ates
variance
of
quantisation
noise
at
algorithm
outputs.
A
pplicable
to
system
s
containing
feedback
loops.
R
estricted
to
LT
I
algorithm
s.
P
erturbation
analysis
[C
on03]
U
se
‘sm
all
signal’
m
odel
of
a
system
in
order
to
linearise
com
ponents.
E
stablish
sensitivity
of
output
to
noise
introduced
into
each
signal
via
sim
ulation.
U
se
these
sensitivities
w
hen
estim
ating
overall
output
noise
variance.
Fast
analytic
technique.
R
equires
one
sim
ulation
of
system
for
each
signal
to
establish
sensitivity
values.
R
estricted
to
differentiable
algorithm
s.
O
ther
T
aylor
expansion
m
ethods
[W
P
98;
G
M
L
C
04]
U
se
either
autom
atic
differentiation
[G
M
L
C
04]
or
T
aylor
approxim
ation
and
forw
ard-propagation
[W
P
98]
to
establish
w
orst-case
output
error.
Fast
analytic
technique.
E
stim
ate
w
orst
case
output
error.
A
pplicable
to
differentiable
system
s.
R
estricted
to
feed-forw
ard
algorithm
s.
2.1 Word-length optimisation 61
T
ab
le
2.
3:
A
co
m
p
ar
is
on
of
w
or
d
-l
en
gt
h
se
le
ct
io
n
te
ch
n
iq
u
es
.
T
ec
h
n
iq
u
e
M
et
h
o
d
C
om
m
en
ts
an
d
d
ra
w
b
ac
k
s
U
ni
fo
rm
w
or
d-
le
ng
th
op
ti
m
is
at
io
n
U
se
th
e
sa
m
e
w
or
d-
le
ng
th
fo
r
al
l
si
gn
al
s.
F
in
d
th
e
op
ti
m
um
w
or
d-
le
ng
th
vi
a
a
bi
na
ry
se
ar
ch
.
Fa
st
es
t
av
ai
la
bl
e
m
et
ho
d.
G
en
er
al
ly
ac
hi
ev
es
hi
gh
co
st
in
co
m
pa
ri
so
n
to
m
ul
ti
pl
e
w
or
d-
le
ng
th
so
lu
ti
on
s.
Fu
ll
se
ar
ch
/I
L
P
fo
rm
ul
at
io
n
[C
C
L
02
]
Se
ar
ch
ev
er
y
po
ss
ib
le
so
lu
ti
on
an
d
ch
oo
se
th
e
op
ti
m
al
,
or
us
e
an
M
IL
P
fo
rm
ul
at
io
n.
C
om
bi
na
to
ri
al
op
ti
m
is
at
io
n
pr
ob
le
m
w
hi
ch
is
N
P
-h
ar
d
an
d
so
ca
n
on
ly
be
ap
pl
ie
d
to
sm
al
le
st
of
ci
rc
ui
ts
.
H
eu
ri
st
ic
s
[K
K
S9
8;
C
C
L
01
]
St
ar
t
fr
om
ei
th
er
an
in
fe
as
ib
le
po
in
t
(d
ue
to
ex
ce
ss
iv
e
qu
an
ti
sa
ti
on
no
is
e)
an
d
gr
ad
ua
lly
re
du
ce
no
is
e
[K
K
S9
8]
,
or
fr
om
a
fe
as
ib
le
po
in
t
(w
it
h
ex
ce
ss
ac
cu
ra
cy
)
an
d
gr
ad
ua
lly
re
du
ce
co
st
[C
C
L
01
].
M
ay
ge
t
ca
ug
ht
in
lo
ca
l
op
ti
m
a,
bu
t
sh
ow
fa
st
pe
rf
or
m
an
ce
in
pr
ac
ti
ce
.
Si
m
ul
at
ed
an
ne
al
in
g
[L
G
C
+
06
]
U
se
st
oc
ha
st
ic
op
ti
m
is
at
io
n:
ra
nd
om
ly
ch
oo
se
po
te
nt
ia
l
ne
xt
st
ep
s,
ch
oo
se
w
or
se
so
lu
ti
on
s
w
it
h
a
de
cr
ea
si
ng
le
ve
l
of
pr
ob
ab
ili
ty
.
M
ay
ge
t
ca
ug
ht
in
lo
ca
l
op
ti
m
a,
bu
t
ca
n
so
m
et
im
es
es
ca
pe
by
ac
ce
pt
in
g
w
or
se
so
lu
ti
on
s.
G
eo
m
et
ri
c
pr
og
ra
m
m
in
g
[C
T
06
]
R
el
ax
th
e
re
st
ri
ct
io
n
on
in
te
ge
r
w
or
d-
le
ng
th
s
an
d
us
e
ge
om
et
ri
c
pr
og
ra
m
to
fin
d
op
ti
m
al
so
lu
ti
on
of
re
la
xe
d
pr
ob
le
m
.
R
ou
nd
w
or
d-
le
ng
th
s
up
to
ne
ar
es
t
in
te
ge
rs
W
or
d-
le
ng
th
s
m
us
t
be
in
te
ge
rs
.
R
ou
nd
ed
up
w
or
d-
le
ng
th
s
un
lik
el
y
to
be
op
ti
m
al
.
So
m
e
co
st
m
od
el
s
ca
nn
ot
be
fo
rm
ul
at
ed
as
a
ge
om
et
ri
c
pr
og
ra
m
.
62 LIST OF TABLES
Despite the significant amount of existing work that has been conducted on word-
length optimisation, to date the community has been focussed on using the method for
minimising the area of algorithm implementations. Work in [Con03] indicates that word-
length optimisation for area significantly reduces the dynamic power consumption of al-
gorithm implementations as a side effect by up to 98% compared to uniform word-length
optimisation. Such large reductions in power consumption are very important for modern
high performance devices and if word-length optimisation can consistently achieve such
reductions it would prove to be an essential tool for algorithm implementation. As de-
scribed in Chapter 1 the work in this thesis aims to develop a word-length optimisation
tool for power consumption optimisation, but existing methods for estimating the dynamic
power consumed by a system are not necessarily well suited to this objective. Section 2.3
summarises relevant research in the area of power consumption estimation, whilst the fol-
lowing section introduces the reconfigurable computing platform that is used to implement
high performance versions of the DSP algorithms targeted by word-length optimisation in
this work.
2.2 The reconfigurable computing platform
At this point in this chapter it is necessary to give details on the further specialisation
of the scope of the word-length optimisation problem that is approached in this thesis.
The preceding section highlighted the fundamental difficulties of word-length optimisation
problems and described existing methods for performing the tasks involved in word-length
optimisation but remained necessarily vague with regard to the computational platform
used in prior work to implement word-length optimised algorithms.
As stated in Chapter 1 it is the objective of this thesis to provide a new method to
aid the design automation of implementations of DSP algorithms on reconfigurable com-
puting devices by automatically selecting the signal word-lengths necessary to minimise
the power consumption of these implementations. DSP algorithms required in areas such
as real-time video processing or data analysis are very computationally demanding due
to the high levels of throughput and arithmetic operations required. Microprocessors are
2.2 The reconfigurable computing platform 63
typically not suited to such applications as it would be necessary for them to use very
high clock speeds (and hence consume large amounts of power) to achieve the necessary
throughput.
Fortunately DSP algorithms can be highly parallelised on custom ASIC or reconfig-
urable computing platforms, allowing low clock speed implementations to be used that are
still able to achieve the throughput necessary for real-time processing. Unlike their ASIC
counterparts reconfigurable devices offer post-fabrication flexibility and lower cost in small
volumes (as discussed in Chapter 1), hence they are good targets for implementations of
specialised DSP applications that require high performance, such as those targeted in this
work. As is discussed below however reconfigurable devices cannot offer the same perfor-
mance levels as ASICs due to their flexibility, whilst their high complexity means they
are difficult to program (in comparison to microprocessor code) hence CAD tools such as
word-length optimisation are essential to ensure that the computational power available
in these devices is utilised effectively with minimum designer effort.
Reconfigurable computing devices implement custom logic circuits and are thus
named because it is possible to reconfigure the device in order for it to implement any
instance of a number of logic circuits that can be accommodated within the device. This
is in contrast to ASICs that have are circuits with dedicated functionality. While ASICs
offer absolute control over the number, size of and organisation of the transistors used to
implement a logic circuit, reconfigurable devices consist of a dedicated number transistors
that are organised so as to implement configurable logic blocks that can be connected in
an arbitrary fashion in order to implement a logic circuit. The configuration of these logic
blocks and the connections between them are stored electronically within the device and
hence are decided after the device has been fabricated, but can be changed at will so that
the device can implement any number of logic circuits.
This flexibility comes at a cost in comparison to ASICs however, as extra tran-
sistors and metal wires are required to build an integrated circuit with the flexibility of
a reconfigurable device, rather than an integrated circuit that has a dedicated function.
The cost of flexibility manifests itself in terms of 40 times the silicon area, 12 times the
64 LIST OF TABLES
power consumption and 3.2 times the speed of ASICs, according to the comparison made
in [KR06]. Nonetheless reconfigurable devices are able to offer high-performance paral-
lel computing capabilities at a lower cost for low-volume applications and without the
associated risk of ASICs (where if a mistake is made in a mask it is permanent).
Reconfigurable devices also offer advantages over computer processors for providing
algorithm implementations as they can take advantage of inherent parallelism within an
algorithm to provide higher throughput compared to single instruction per cycle proces-
sors: in [TCW+05] for example it is shown that reconfigurable computing designs can
achieve up to 500 times speedup with reductions of up to 70% in power consumption over
microprocessor implementations for specific applications.
The remainder of this section is dedicated to describing the construction of a par-
ticularly successful class of reconfigurable device called Field Programmable Gate Arrays
(FPGAs), and summarises the process of implementing an algorithm on such a device, as
this information is highly relevant for the purpose of estimating the power consumption
of algorithms implemented on the device, as is seen later in Section 2.3.
2.2.1 FPGA construction
FPGAs are reconfigurable devices that store their current configuration in registers within
the device, hence requiring configuration every time the device is powered on. FPGAs are
characterised by an extremely flexible routing network that allows for a large array of
designs to be implemented using the available logic. This section describes the basic
features of a particular type of FPGA called the Virtex II Pro which is manufactured by
Xilinx [Xil04] and contains features that are very similar to those of other current high
performance FPGAs.
The Virtex II Pro is used in this thesis in order to gather real device data and
to demonstrate the techniques developed on a well studied FPGA device. It should be
noted however that the techniques developed in this thesis are also applicable to other
FPGA families including, for example, the new Virtex 4 and Virtex 5 devices that are
also manufactured by Xilinx, as has been demonstrated in work published by the author
2.2 The reconfigurable computing platform 65
LUT Carry RegisterLogic
O
utput
SelectA
B
C
D
(a)
LUT Carry RegisterLogic
O
utput
SelectA
B
C
D
LUT Carry RegisterLogic
O
utput
SelectA
B
C
D
(b)
Figure 2.3: (a) The basic logic blocks available in an FPGA, comprised of a 4-input
LUT, carry logic, and a register. (b) The contents of a Virtex II Pro SLICE.
in [CCC07] and summarised in Chapter 5.
The basic building blocks that form the configurable logic of the Virtex II Pro
FPGAs are four-input look-up tables (LUTs). Inside the LUT, the output for every
possible combination of values of the 4-inputs is stored in configuration registers. A tree
of multiplexors are then used to select the appropriate output for the LUT, according to
the current inputs.
The output of each LUT can be connected to a register if required, via some optional
carry logic that can be used to facilitate addition. Figure 2.3(a) depicts the organisation
of this set of basic building blocks.
In the Virtex II Pro FPGA two sets of these building blocks are grouped together
vertically and connected by a carry signal, into what is called a SLICE, as shown in
Figure 2.3(b). SLICEs are in turn organised in pairs connected by a carry signal, with
two pairs of SLICEs grouped together to form what is called a CLB (Configurable Logic
Block). Figure 2.4(a) shows the structure of a CLB. Note that the two pairs of SLICEs
in a CLB are not connected together by a carry signal, but the two SLICEs within each
pair are. Each CLB also contains a switch box that allows the SLICEs within the CLB
to be connected both to each other and to other CLBs via the configurable interconnect
in an FPGA.
Note that the wires that are used to form the carry signals between full adders are
not part of the configurable routing network, and that carry signals that pass vertically
from CLB to CLB exist to allow the creation of faster carry chains than would be possible
if the routing network was used.
66 LIST OF TABLES
SLICE
SLICE
SLICE
SLICE
Sw
itch Box
(a)
CLB CLB CLB
CLB CLB CLB
CLB CLB CLB
CLB CLB CLB
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O I/O I/O
I/O I/O I/O
Mult
Mult
Mult
Mult
I/O
I/O
RAM
RAM
RAM
RAM
I/O
I/O
(b)
Figure 2.4: (a) The contents of a Virtex II Pro CLB (b) A typical island style ar-
rangement of logic and routing in an FPGA.
Xilinx FPGAs, like those of other FPGA manufacturers, have CLBs arranged in
an ‘island’ style within a ‘sea’ of configurable routing wires. CLBs are placed in a grid
formation, as shown in Figure 2.4(b), with many routing wires of different lengths lying
in channels between each row and column of CLBs. Note that, as shown in Figure 2.4(b),
FPGA logic does not consist entirely of CLBs, but includes some blocks for off-chip com-
munication (I/Os), memory (RAM), and multiplication (Mult). I/O blocks normally
surround the outside of the FPGA grid, whilst memory and multiplication blocks tend to
be placed in columns within the fabric.
The interconnect of FPGAs is constructed of buffered metal wires of different
lengths that span the vertical or horizontal gap between two CLBs that are either ad-
jacent, separated by a CLB, or separated by four CLBs. Multiple interconnect wires can
be connected together in order to construct a path between any pair of CLBs in the FPGA.
Connections between interconnect wires or between the interconnect and CLBs are made
through switch boxes that are formed of sets of multiplexors that select the appropriate
connection between wires according to configuration registers.
As FPGAs contain a large amount of configurable logic that may be connected
together in many different ways using the versatile routing fabric, using an FPGA’s re-
sources efficiently whilst achieving high performance is a difficult task. The following
section describes the functionality of the set of CAD tools that are typically used to take
2.2 The reconfigurable computing platform 67
Design entry
Technology mapping
Placement
Routing
Working hardware
Figure 2.5: An illustration of the design flow for FPGAs.
a description of an algorithm and implement it in an FPGA.
2.2.2 FPGA algorithm implementation
This section gives a general introduction to the process of implementing an algorithm on
an FPGA. The steps required to take an algorithm description and map it onto an FPGA
implementation are summarised in Figure 2.5.
Whilst each of these steps are the subject of intense research themselves, it is only
necessary for the purposes of this thesis to summarise their purpose and the methods used
to perform each step. The survey paper [CCP06] provides an excellent overview of existing
research into each of these steps and is used to provide the summary given here.
Design entry. The first step required is for the hardware designer to enter a description
of the algorithm to be implemented. Typical design specification languages are Register
Transfer Level (RTL) languages such as VHDL and Verilog that specify operations per-
formed at each clock cycle, general purpose languages such as C or SystemC, or domain
specific languages such as Matlab or Simulink. RTL descriptions of an algorithm are
translated into a netlist of connected logic gates by a logic synthesis tool before tech-
nology mapping proceeds. Higher-level descriptions must first be mapped to an RTL
description by a behavioural synthesis tool. As well as the description of the algorithm
the user can specify design constraints such as the required clock speed, input and output
68 LIST OF TABLES
pin locations, etc.
For the design of implementations of DSP algorithms there exist several tools that
allow these to be described in the form of a block diagram containing arithmetic com-
ponents connected by communicating wires, as in the case of the System Generator tool
[Xil08b] made by Xilinx which is used for the purposes of design entry in this thesis. Fur-
ther detail on this and the other tools used during the course of the work conducted in
this thesis are summarised in Chapter 3.
Technology mapping. This step maps logic elements such as registers and gates de-
scribed in a netlist onto the functional units available on the target device, such as the
basic LUTs, carry chains and registers available, as well as the more complex units like
multipliers and memory blocks.
Placement. Placement determines the location of each mapped functional unit on the
target device. In order to maximise the clock frequency of the design it is essential that
the combinational logic paths that are expected to have the longest delay are kept as
short as possible: their functional units must be clustered together on the FPGA. Hence
placement is an optimisation problem, good solutions for which are typically found using
stochastic optimisation techniques like simulated annealing in commercial tools such as
the Xilinx ISE software [Xil08a].
Routing. Routing connects the placed functional units on the FPGA together using the
configurable routing network. Again this step is critical to the delay of paths in a circuit
and is normally the most computationally expensive step in the FPGA tool flow. Routing
is generally performed in two steps: i) global routing, where the switch boxes used by
each net in the circuit are decided, and ii) detailed routing, where the metal wires and
configurable switches used to route a net are selected. In order to both achieve timing
constraints and route a congested circuit through limited routing resources the routing
step must evaluate many different routing options, resulting in the high complexity of
routing FPGA designs.
2.3 Estimation of power consumption of digital circuits 69
Summary
The complexity involved in the steps above has implications on the complexity of power
consumption estimation for FPGAs at various stages in the design flow: for example,
estimates made from a design description are likely to be less accurate due to the lack of
specific detail available on the functional units and configurable routing wires used, but
obtaining this information requires computationally expensive steps in the FPGA design
flow to be performed. The implications of this and the accuracy achievable when making
power consumption estimates at various stages of the design flow are explored through the
summary of existing work in this field, presented in the following section.
2.3 Estimation of power consumption of digital circuits
As has been discussed in the introduction to this thesis, the increasingly high power con-
sumption of modern computing devices is a concern for designers as it increases the costs
of powering, packaging and cooling devices, and prevents or limits their use in battery-
powered applications [RJD98]. The reductions in power consumption offered by word-
length optimisation over standard algorithm implementation techniques are significant
and will become another tool with which hardware designers can combat power consump-
tion, provided that algorithms can be optimised in useful amounts of time for the fast
moving electronics industry.
In order for this to be possible it is essential that models be developed that can
estimate the power consumption of a system very quickly, due to the large search space
of word-length optimisation problems as described in Section 2.1. This section aims to
introduce existing techniques for estimating the power consumed in an algorithm imple-
mentation and to highlight their suitability for use within word-length optimisation for
power consumption minimisation. In the following section the processes through which
power is dissipated in digital CMOS circuits such as FPGAs are summarised, Section 2.3.2
then introduces related research on modelling of FPGA power consumption.
70 LIST OF TABLES
Vdd
V
outVin
Figure 2.6: An inverter implemented in CMOS using a p-type transistor (top) and n-
type transistor (bottom). The p-type transistor is connected to the power supply wire
whose voltage is Vdd and the n-type transistor is connected to ground. The inverter
input and output voltages are Vin and Vout, respectively.
2.3.1 Power dissipation in digital CMOS circuits
For almost 30 years CMOS has been the technology of choice for implementing digital
circuits due to the faster and lower power consumption logic that can be achieved compared
to PMOS and NMOS technologies [TW01]. Logic gates implemented in PMOS and NMOS
dissipate significant amounts of power when their outputs are high and low, respectively,
due to direct connections between the power supply wire and ground wire in the circuit.
CMOS logic gates on the other hand do not suffer from this problem as they use
both n-type and p-type transistors, as shown for an inverter in Figure 2.6. When the
input voltage to this inverter is logic zero, i.e. Vin = 0, the n-type transistor is in cut-off
mode and conducts (virtually) no current. Similarly when Vin = Vdd, i.e. logic one, the
p-type transistor is in cut-off mode and conducts (virtually) no current. As a result when
a stable input is applied to a CMOS logic gate there are no direct connections between
Vdd and ground, unlike in PMOS and NMOS circuits.
Power is still dissipated in CMOS circuits however due to: i) small leakage currents
through transistors in cut-off mode, ii) direct connection of Vdd and ground that can occur
when the output of a logic gate changes from logic one to logic zero (or vice versa), and
iii) charging and discharging of parasitic capacitances when the outputs of logic gates
transition between logic zero and logic one [RP00].
The main sources of leakage current are sub-threshold leakage, which occurs due to
2.3 Estimation of power consumption of digital circuits 71
the finite resistance of a transistor when it is in its sub-threshold region, and gate leakage
which is a tunnelling current through the gate oxide insulation of transistors [KAB+03].
Leakage currents are in constant effect when a CMOS device is turned on and do not
change as a result of the switching activity within the device, i.e. they occur even when
the device is not performing computation and hence static power is the name given to the
power dissipated due to these currents.
The remaining power dissipated in CMOS circuits is thus called dynamic power and
is formed from short-circuit dissipation (ii above) and the power consumed due to switching
parasitic capacitance (iii above). Work such as [Poo02] and [CNC96] indicates that short-
circuit power dissipation forms a small percentage of overall dynamic power consumption
(around 10%) and hence power dissipation due to switching of parasitic capacitance forms
the majority of dynamic power consumption. Work in [Poo02] also indicates that short-
circuit power consumption can normally be modelled within the power consumed due to
the switching of parasitic capacitance as the two occur at the same time: when logic gates
change value.
Parasitic capacitance in a logic gate built in CMOS circuits arises from a number
of sources: the capacitances of the gate inputs of transistors driven by the logic gate,
overlap capacitance due to unwanted diffusion of the drain and source impurities into the
channel region, diffusion capacitance in the driving transistors, and finally, interconnect
capacitance between the metal wires that carry signals between logic gates and between
those signals and ground [RP00]. In [RP00] a method for determining the overall parasitic
load capacitance at the output of a logic gate due to the above sources is presented.
However it is not normally possible to measure each of the above sources of parasitic
capacitance individually, and instead it is more convenient to measure or estimate the
overall parasitic capacitance of a logic gate directly from: i) information provided by
silicon manufacturers, and, ii) device-level measurements.
Figure 2.7 depicts the parasitic capacitance model from [RP00] with all the sources
of parasitic capacitance mentioned above modelled as the single load capacitor CL. When
the output of the logic gate shown in Figure 2.7 is raised from logic zero to one, causing
72 LIST OF TABLES
Vdd
V
out
A1
PMOS
NETWORK
NMOS
NETWORKA2 CL
Figure 2.7: The model of the parasitic capacitance of a CMOS logic gate described in
[RP00].
the voltage Vout at the output of the gate to rise from 0 to Vdd, the energy E drawn from
the power supply is given by:
E = CLV 2dd (2.24)
Half of this energy, i.e. CLV 2dd/2, is stored in the parasitic capacitance CL, whilst the other
half is dissipated as heat in the p-mos transistor network. When the output of the logic
gate is switched from logic zero back to logic zero, the charge stored on the capacitor is
dissipated as heat in the n-mos transistor network [RP00]. Hence if the output of the logic
gate switches at a frequency of f the resulting power dissipated is:
P = fCLV 2dd (2.25)
It can be seen from the form of (2.25) above that dynamic power is affected by the
switching activity and hence the functionality of a CMOS circuit, and also by the capaci-
tance and hence construction of that circuit. On the other hand static power consumption
is only affected by the number and geometry of the transistors used to implement a CMOS
circuit. As described in Section 2.2, reconfigurable devices contain a fixed number of tran-
sistors and routing wires that allow the implementation of logic gates that are connected
together in an arbitrary fashion. Hence it can be deduced that the static power consump-
tion of reconfigurable devices is fixed, but their dynamic power consumption may change
depending on the algorithm implemented on the device. As such it should be clear that
word-length optimisation for power consumption minimisation has a direct effect on the
2.3 Estimation of power consumption of digital circuits 73
dynamic power consumption of a reconfigurable device. As a result the remainder of the
existing work and novel contributions described in this thesis aim only to use word-length
optimisation to reduce dynamic power consumption.
The analysis of dynamic CMOS power consumption in this section and the re-
sulting dynamic power consumption estimation equation (2.25) form the basis of existing
techniques for estimating the dynamic power consumed in digital CMOS circuits and
reconfigurable devices, as is detailed in the following section.
2.3.2 Modelling of device power consumption
Whereas Section 2.3.1 described the mechanisms through which power is consumed in
CMOS circuits, this section aims to describe existing work that has used this knowledge
of the physics of power dissipation in order to estimate the power consumed in a recon-
figurable device. The complexity of making such an estimate ought to be immediately
apparent from the both the nature of reconfigurable devices as described in Section 2.2,
and the way in which they dissipate power. As a result of this complexity power consump-
tion estimation methods typically abstract away from some of the details of a device in
order to estimate power consumption in feasible amounts of time. The greater the abstrac-
tion, the greater potential there is for simplification of the power consumption estimation
problem, however greater simplification typically results in a lower level of accuracy for
power consumption estimates. This section will summarise existing techniques in this area
in increasing order of abstraction.
Switch-level simulation
Section 2.3.1 showed that the dynamic power dissipated by the switching of the output of
a CMOS logic gate is determined by the supply voltage Vdd, parasitic capacitance C driven
by the gate, and the frequency f at which the output is switched, as shown in (2.25). In a
circuit clocked at the rate fclk then, the average power consumption of a particular logic
gate whose output is switched at an average rate of α times per cycle, the average power
74 LIST OF TABLES
consumption is given by:
Pav =
α
2
· fclk · C · V 2dd (2.26)
Hence it is possible to estimate the dynamic power consumption of an entire clocked circuit
if both the average number of transitions per cycle and parasitic capacitance of the output
of every logic gate is known.
The capacitance values of integrated circuit components are known to FPGA manu-
facturers through measurements made from devices and from device level simulation tools
such as SPICE, as explained in detail in [DT05]. Once an algorithm has been mapped,
placed and routed onto an FPGA it is possible to use the known capacitance values of
each circuit element to calculate the total capacitance of the configurable routing wires
used by the algorithm, and of the logic gates used by the algorithm.
Obtaining average activity values for each logic gate in a circuit is a more complex
problem, however. Switch-level simulation determines signal activities by performing a
simulation of a circuit. This technique propagates logic transitions at the inputs of a logic
gate to its outputs, depending on the current values of the stable inputs to the logic gate,
and the operation implemented by the gate. Additionally it is possible to account for
the inertial delay of logic gates, which is the phenomenon where several input transitions
occurring in a very short space of time are filtered out due to the rise and fall times of
the outputs of the transistors within logic gates. Switch-level simulation can thus achieve
accurate estimates of the activity within a circuit, and is used to estimate the activity of
placed and routed circuits in the tool offered by Xilinx called XPower [Xil07]. Work in
[LLH+05] describes in greater detail the construction of a switch-level simulation tool and
the calculation of capacitances required for power estimation.
This method of power consumption estimation has a significant drawback however:
its computational complexity. To obtain capacitance values an FPGA design must be
placed and routed, and as explained in Section 2.2.2 the steps involved in reaching this
state have a high computational cost. Worse than this however is the problem of obtaining
activity values through switch-level simulation, as each transition that occurs during sim-
ulation requires a number of microprocessor instructions to process and log. Transitions
2.3 Estimation of power consumption of digital circuits 75
in a combinational logic circuit can accumulate because of the different arrival times of
signals at the inputs to each logic gate, causing multiple transitions (glitches) before the
gate settles to its final value for that clock cycle. The larger the combinational circuits in
an algorithm the more glitches and hence the slower is switch-level simulation.
In addition, transitions in a logic signal from one clock cycle to the next occur
due to the transmission of data that is changing with time, hence the transition activity
of elements within a digital circuit is data dependent. This dependency means that the
use of simulation for activity estimation in digital circuits faces similar problems to range
analysis and noise analysis in word-length optimisation (see Section 2.1.2): to ensure a
circuit is exercised properly during simulation, test vectors should ideally be extremely
long, however the computational complexity of simulation dictates that shorter test vectors
will be favored at the cost of accuracy.
Although switch-level simulation for activity estimates and full place and route for
capacitance values are very useful for obtaining accurate estimates of the power consumed
by circuits (or parts of circuits) driven by specific test vectors, they are clearly unsuitable
as a method for estimating power consumption during word-length optimisation, due to
their computational complexity and the reasons cited in Section 2.1. Due to the need
for faster power consumption estimates for other CAD tools, there exists other research
work in this area where higher levels of abstraction are leveraged to lower computational
complexity, as detailed in the following subsections.
Transition density
As stated in the preceding section switch-level simulation of a circuit is computationally
expensive because each logic transition in each signal that occurs during simulation must
be processed. In what is known as the transition density method, originally proposed in
[Naj93], a probabilistic technique for estimating the average activity in the logic gates
of a circuit is proposed which removes the need for simulation and the dependency on
appropriate simulation vectors.
The method estimates the expected number of transitions at the output of a logic
76 LIST OF TABLES
block during a clock cycle and is defined for a multiple-input logic function f as follows.
Given a logic function f(x1, x2, ..., xn) of the boolean inputs x1, x2, ..., xn, the output of
f will depend on input xi when the boolean difference of f with respect to xi, i.e. ∂f∂xi , is
equal to 1. In other words, transitions in input xi will be propagated to the output of f
when ∂f∂xi = 1. The boolean difference of f with respect to xi is defined as:
∂f
∂xi
= f |xi=0 ⊕ f |xi=1 (2.27)
Thus we can estimate the number of transitions at the output of the logic block f
due to activity on the input xi by calculating the probability that the boolean difference
∂f
∂xi
is satisfied, and multiplying this by the expected number of transitions on input xi,
denoted D(xi) and called the transition density of xi. The total activity D(f) at the
output of logic block f is thus the sum of the activities contributed by each input of f
[Naj93]:
D(f) =
∑
i∈I
P
(
∂f
∂xi
)
D(xi) (2.28)
Applying the transition density model to a circuit involves propagating both signal proba-
bilities (to allow boolean differences to be calculated) and transition density estimates
through the logic blocks of the circuit. Transition density estimates are propagated
through the method described above, whilst signal probabilities are estimates of the av-
erage value that a signal takes over time and can be automatically propagated through a
circuit using a variety of methods as summarised in [EFD+89].
Whilst the transition density method is faster than switch-level simulation of a
circuit, the propagation of transition densities and signal probabilities remains computa-
tionally expensive for large circuits that consist of large portions of combinatorial logic.
Additionally, exactly calculating signal probability in a circuit has been shown to be
NP-hard [KT89], hence approximations must be calculated for large circuits to reduce
complexity.
Unlike switch-level simulation transition density cannot account for the inertial
delay of logic gates as a zero-delay model is used, hence the method may tend to over-
2.3 Estimation of power consumption of digital circuits 77
estimate activity in a circuit. Also, the activity values in a circuit are only part of the
information required to estimate power consumption: capacitance values for routing wires
are also necessary. Configurable logic capacitance values can be assumed to be identical
throughout a device and so can be measured once only, however the capacitance of the
configurable routing wires are affected by the place and route process and hence by the
algorithm implemented on an FPGA. In [PYW02] the transition density is used to esti-
mate the activity in circuits implemented in FPGAs, however routing capacitance values
are obtained from fully placed and routed circuits that would be too computationally
expensive to obtain during word-length optimisation. Hence for the reasons above, the
direct application of the transition density method to the problem of estimating power
consumption during word-length optimisation is undesirable.
Pre-routing capacitance estimation
Estimation of the capacitance of the wires in the configurable interconnect of an FPGA
before routing has been performed has been considered in existing work, with the aim of
allowing such estimates to be used in power-aware placement or routing of FPGA designs.
Work in [FWAW05] uses information after a netlist has been technology mapped (see
Section 2.2.2), whilst [AN04] uses information available after a design has been placed
onto an FPGA, in order to predict individual net capacitance. Work in [BB07] also relies
on information available after placement and uses a probabilistic analysis to determine the
likelihood of each resource in the FPGA routing fabric being used in order to obtain total
routing power estimates. The methods used in these pieces of work will now be described
in more detail below.
In [FWAW05] the authors propose using the fan-out of each net in a technology
mapped netlist to predict the net’s capacitance. The fan-out of a net is the number of
outputs it has. Whilst this information is available before the placement of an FPGA
design (which as explained in Section 2.2.2 is computationally expensive), fan-out alone
is not a very good predictor of capacitance, as demonstrated by results presented in this
thesis in Section 5.1.1, and by results in [AN04]. Although high fan-out may be an indicator
78 LIST OF TABLES
of high capacitance, fan-out alone cannot account for the high capacitance of nets that
are stretched between logic blocks that are far apart.
In [AN04] post-placement information is used that significantly enhances the ca-
pacitance estimates made. After placement has been performed for an FPGA design the
locations of all logic blocks are known, and hence a measure of the distance between the
blocks that a net connects to can be used to estimate capacitance. The measure used is
the half-perimeter of an imaginary box which bounds the net, which is commonly used
as a substitute for minimum wire-length at the placement stage as it is much more easily
calculated. The half-perimeter bounding box is the true minimum wire-length for two and
three-terminal nets, whilst for nets with more terminals it serves as an approximation of
minimum wire-length.
Additionally the work in [AN04] uses the following details that are available after
placement of designs onto Xilinx FPGAs: i) the number of connections that a net has to
each of the two different types of LUT on the FPGA (F and G LUTs), and ii the number
of CLB tiles that a net has at least one connection in (see Section 2.2 for details on FPGA
construction). The authors examine various combinations of the above parameters and
achieve a high level of accuracy: the mean relative error in capacitance across the set of
benchmark circuits used is 36%, compared to 84% when using fan-out alone.
Work in [BB07] takes a significantly different approach to both [FWAW05] and
[AN04]. Rather than estimating the capacitance of individual nets, the demand on the
resources (i.e. switch boxes, wire segments) of the routing channels in a placed design is
calculated according to work in [KB06]. These demands determine the probability that
particular resources are used by particular nets. The expected power contributed to a
particular routing channel R by a net j can thus be estimated by multiplying the demand
probability Dkj that net j uses resource k by the activity αj of net and by the capacitance
Ck of resource k, and summing these terms for all resources K in the channel:
PRj =
∑
k∈K
1
2
DkjαjCkfclkV
2
dd (2.29)
where fclk and Vdd are the circuit’s clock frequency and power supply voltage, respectively.
2.3 Estimation of power consumption of digital circuits 79
The demand probability Dkj that net j uses resource k in routing channel R is calculated
according to the distance between net j and the resource and also according to the demand
exerted on the resource by other components, as described in [KB06].
Whilst the capacitance estimation results presented in [BB07] indicate the method
is accurate it is difficult to compare the work to [AN04] as only error in the total capac-
itance estimated for a benchmark system is quoted in [BB07], rather than individual net
capacitance error. Additionally [BB07] uses the Virtual Place and Route (VPR) framework
to gather true capacitance and power values for comparison to the estimates made, but
VPR is an FPGA architecture exploration tool and architectures used are not necessarily
representative of the Xilinx FPGA used in [AN04].
Nevertheless these pieces of work exhibit promising results for pre-routing capaci-
tance estimation that will be useful in power-aware placement and routing. Unfortunately
they are not suitable for use within power consumption estimation for word-length optimi-
sation due to the need for technology mapping (in [FWAW05]) and also for the placement
of logic blocks to have been calculated (in the case of [AN04] and [BB07]), which are too
computationally expensive for use within word-length optimisation.
Activity estimation from word-level information
An alternative method that has been developed in existing work for abstracting away
from the detail of switch-level simulation is to use word-level information on a signal
to estimate the bit-level activities of the two’s complement representation of that signal.
This word-level information can be measured during high-level simulation of an algorithm.
Whereas switch-level simulation must process each transition that occurs in a system
during simulation, word-level simulation need only execute each operation that occurs
during simulation and log the result, which can be orders of magnitude faster than switch-
level simulation as most if not all operations in the algorithm will map to single instructions
on the microprocessor on which the simulation is performed.
The Dual Bit Type (DBT) method presented in [LR96] represents the first impor-
tant piece of work in this area. In this work the authors demonstrate that word-level
80 LIST OF TABLES
parameters of Gaussian signals can be used to estimate the bit-level activities of two’s
complement signals. The authors identify first of all that a number of bits starting from
the Least Significant Bit (LSB) side of a two’s complement representation of a Gaussian
signal X[n] exhibit no correlation with each other or with themselves over time (i.e. from
one signal sample to the next), and that for each bit xi in this region:
P (xi = 1) = P (xi = 0) = 0.5 (2.30)
i.e. there is an equal probability of each bit in this region being a one or a zero. As a result
the probability that a logic transition occurs in one of the bits xi in this region when the
current sample changes from X[n] to X[n+ 1] is given by:
P (xi 6= x′i) = P (xi · x′i) + P (xi · x′i) = 0.5 (2.31)
where x′i is the value of bit i in sample X[n+ 1].
The authors also noted that as the range that can be represented by the two’s
complement representation of a signal is larger than the range of the signal itself (to avoid
overflow), then at the Most Significant Bit (MSB) end of the signal there will be the sign
bit followed by several bits whose value is always identical to the sign bit (they can be
seen as sign extension bits).
The authors show that if the represented signal is Gaussian, the ‘break-point’ BP0
below which all LSB-side bits lie within the region of uncorrelated bits with transition
activity level 0.5 (which is denoted region 1) can be calculated using:
BP0 = log2 σ + log2
(√
1− ρ2 + |ρ|/8
)
(2.32)
where σ and ρ are the standard deviation and lag-1 autocorrelation coefficient of the
Gaussian signal, respectively. The ‘break-point’ BP1 above which all MSB-side bits are
identical to the sign bit (denoted region 3) can be calculated using:
BP1 = log2 (|µ|+ 3σ) (2.33)
2.3 Estimation of power consumption of digital circuits 81
where µ is the mean of the Gaussian signal. The equations for both BP0 and BP1 in
(2.32) and (2.33) above calculate the bit positions of the above described breakpoints
relative to the binary point of the two’s complement signal.
Whilst the activity of the uncorrelated bits in region 1 is known to be 0.5, the
activity of the sign bit and ‘sign extended’ bits in region 3 can be affected by the word-
level parameters of a Gaussian signal. Work in [BHS98] demonstrated that for a zero mean
Gaussian signal the probability of a transition in the bits in region 3 can be determined by
calculating the probability of the Gaussian signal changing from a positive to a negative
value (or vice versa) between one sample and the next. This can be calculated according
to the joint probability distribution function (PDF) of the current and previous samples
in the signal, which is assumed to be a bivariate Gaussian distribution when the signal
represented is Gaussian. Hence the probability of a sign change in a Gaussian signal with
lag-1 autocorrelation parameter ρ can be determined by integrating over the appropriate
area of the bivariate Gaussian distribution as shown in [BHS98] and given by (2.34), and
this probability also gives the transition probability Tmsb of the sign bit(s) in region 3 of
a zero-mean Gaussian signal.
P (sign change) = Tmsb =
1
pi
cos−1(ρ) (2.34)
Whilst the activities in regions 1 and 3 are clearly defined in the DBT model, there
remain several bits in a region between these two (called region 2) which show increasing
spatial correlation (i.e. correlation between a bit and its adjacent bits) as they get closer
to the MSB end (where the fully correlated bits of region 1 lie). The authors note however
that this region only contains a small number of bits and that as the activities in both
regions 1 and 3 are known it is possible to use linear interpolation between these two
regions to obtain estimates of the activity values of the bits in region 2.
A typical bit-level activity profile of a Gaussian signal is shown in Figure 2.8. As
described in the DBT model, region 1 can be approximated by uncorrelated bits each
having an activity rate of 0.5, region 3 can be approximated by fully spatially-correlated
bits with activity Tmsb, and Region 2 can be approximated by linear interpolation between
82 LIST OF TABLES
1 2 4 6 8 10 12 14 16 1819
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Signal bit
A
ve
ra
ge
 a
ct
iv
ity
 (t
ran
sit
ion
s /
 cy
cle
)
Region 1 Region 3
BP0 BP1
T
msb
Figure 2.8: The typical profile of a Gaussian signal, with activity regions 1 and 3 from
the DBT model highlighted. The breakpoints between activity regions and the level
of activity in region 3, Tmsb, are also shown. Bit 19 is the MSB.
Regions 1 and 3.
The work in [LR96] and [BHS98] allows an important abstraction away from bit-
level simulation to word-level simulation of systems in order to estimate power consump-
tion. The mean, standard deviation and lag-1 autocorrelation of each signal in a system
can be measured during a high-level simulation of the system, whilst measuring bit-level
activities directly would require either logging each bit change in each two’s complement
signal or performing a switch-level simulation, both considerably more computationally
expensive than a high-level simulation.
Unfortunately however the DBT method has a significant limitation in that it
provides no way of estimating the activity of signals whose values change multiple times
within a clock cycle, i.e. signals that contain glitches due to different signal arrival times
in preceding logic and hence change more than once per clock cycle before settling to their
new value. Any signals that emerge from combinational logic blocks rather than clocked
registers can contain glitches and so these can represent a significant proportion of the
activity in arithmetic intensive circuits. The DBT model is thus an accurate technique
for estimating signal activity for registered signals only, and cannot be applied to internal
signals within combinational logic that contain glitches, for example.
The DBT model has one other limitation in that it is an accurate technique for
2.3 Estimation of power consumption of digital circuits 83
estimating the bit-level activity of two’s complement Gaussian signals, but signals with
other PDFs may not exhibit the same bit-level activities as Gaussian signals. As indicated
in [LR96] and [BHS98] however signals in digital signal processing applications are in gen-
eral well-approximated by Gaussian signals and additionally it is known from the Central
Limit Theorem [Chu74] that the PDF of the sum of a number of independent random
variables will tend towards a Gaussian PDF. For digital signal processing applications
then the DBT model can be expected to provide accurate estimates of bit-level activities
from word-level signal statistics.
Macro-modelling
Existing work on power consumption estimation in FPGAs discussed up to this point has
concentrated on abstraction away from either the capacitance or the activity of signals
within a circuit. In this section techniques which leverage abstraction away from both
the capacitance and activity of blocks of combinatorial logic in an FPGA are presented.
These power consumption models have been used to estimate the power consumed in
combinational circuits of a variety of sizes from LUTs in [LLH+05] (discussed earlier
in this section under switch-level power estimation) to logical operators and arithmetic
components in [SJ01; JTB04; RSN06; CW06] and due to their application to these larger
circuits they are termed macro-models.
The macro-modelling techniques in [SJ01; JTB04; RSN06; CW06] are based on
the following principle: that the dynamic power consumption in a logic circuit can be
estimated by using a pre-measured value of the logic circuit’s power consumption made
when similar input activities were used to those input activities currently observed in
the circuit. Because pre-measured power consumption estimates are used macro-models
can be extremely fast: if all components in a circuit can be mapped to macro-models no
switch-level simulation, placement or routing is required to estimate power consumption.
However the problem of knowing the power consumption of a combinational circuit
under any sequence of inputs has similar complexity to other simulation-based problems
discussed in the background to this thesis (see Section 2.1.2): there can be an infinite
84 LIST OF TABLES
number of input vectors to such a circuit and it will thus be impossible to measure the
circuit’s power consumption for all but a small subset of these. Hence macro-models have
a small, finite number of power consumption data points (with each point having different
input activities) from which to estimate the power consumption of a circuit. The first main
differentiator of macro-models then is the method through which the observed activities of
the inputs to a circuit are mapped to one of the activities that were used for pre-measured
power consumption values.
To estimate dynamic power consumption in a circuit it is necessary to know the
capacitance of signals within the circuit as well as their activities. Macro-models must
account for these capacitances and the way in which glitches build up through a circuit
due to its construction. The second main differentiator of macro-models is the method
through which the capacitances and construction of the circuits are accounted for. The way
in which the macro-models in [SJ01; JTB04; RSN06; CW06] approach the two problems
outlined above are discussed in detail below.
The work in [SJ01] constructs several macro-models, each able to estimate the
power consumption in one of several small circuits ranging in complexity from a 32-bit
two’s complement adder to an ALU and both Finite Impulse Response (FIR) and Infinite
Impulse Response (IIR) filters. In order to construct a macro-model the input signals to
a circuit are parameterised as follows. Firstly the input bits to the circuit are partitioned
into groups such that the input bits within each group exhibit a minimum level of spatial
correlation, whilst bits that are in different groups have lower spatial correlation than this
minimum level. An equation as shown in (2.35) is then constructed to model the power
consumption in the circuit using the following parameters: the average signal probability
Pk, average activity Ak and average spatial correlation Sk for each input group k, and the
the average transition density D of the circuit’s outputs (which must be estimated using
the transition density technique).
Power = cDD +
∑
k∈K
(cPkPk + cAkAk + cSkSk) (2.35)
In the power consumption estimation equation for [SJ01] above cD, cPk , cAk and cSk are
2.3 Estimation of power consumption of digital circuits 85
characterisation coefficients whose values are determined through least squares regression
using a set of power consumption values gathered using switch-level estimation (e.g. the
Xilinx XPower tool [Xil07]). Although only linear terms in Pk, Ak and Sk are shown
above, quadratic and cubic forms as well as cross-terms of these three variables can be
included for every input bit group k if necessary, where higher order terms are included
automatically by a heuristic during characterisation in order to ensure the macro-model
fits to the characterisation data within a given level of error. The resulting models contain
up to 17 terms as a result, but no theoretical explanation is given for the higher order
terms required.
The proposed model suffers significantly when a power estimate is required for a
circuit driven by an input signal significantly different from those under which the circuit
was characterised. In such a situation the authors propose using an ‘adaptive’ approach
where both the power estimated by (2.35) and a switch-level power estimate obtained
from a short simulation of the circuit under the new inputs are combined to make a more
accurate power estimate than could be obtained otherwise.
Although this approach may reduce the error of the power consumption estimates
made in such situations, it removes the main advantage of power macro-models which is
that they can be evaluated extremely quickly without the need for performing switch-level
simulation.
A second shortcoming of this work is that any circuits that have the same compu-
tational function but have inputs of different word-lengths require different macro-models.
Adder circuits of input word-lengths of 8, 9 and 10 bits, for example, each require a differ-
ent macro-model. Hence this work is un-attractive for use within word-length optimisation
as many macro-models would need to be characterised and stored in memory for each type
of operation in a system that is to be optimised.
Work in both [JTB04] and [RSN06] develops power macro-models for adders and
multipliers in FPGAs and in contrast to the work in [SJ01] they include the word-lengths
of the inputs to a component as one of the parameters in their power consumption macro-
models. This allows the creation of only one macro-model per component type (i.e. adder,
86 LIST OF TABLES
multiplier, etc.) for which power estimation is required. The two pieces of work differ
slightly in the parameters used to model the activity of the input signals to components.
The work in [JTB04] uses the word-length N of the inputs to the component, the
average activity Ai across the bits of each input i and the average spatial correlation Si
across the bits of each input i. In a similar way to [SJ01], power is estimated by using
a sum of a number of combinations (of up to cubic order) of the above terms, with each
combination having a characterisation coefficient determined through least mean squared
regression. Results are presented that show that the accuracy of the macro-models for
adders and multipliers increase as the polynomial order of the terms (and cross-terms)
in the power consumption equation is increased from linear to cubic. As with [SJ01]
however no theoretical explanation for the cubic order power consumption equations for
these components is offered. The resulting cubic equations require the characterisation
and storage of 20 coefficients.
In [RSN06] the macro-models developed use the word-length N of the inputs to
the component, the hamming distance Hi (the percentage of bits that change per clock
cycle) of each input i, the one distance Oi (the percentage of bits that remain at logic one
per clock cycle) and the zero distance Zi (the percentage of bits that remain at logic zero
per clock cycle) for each input i. Unfortunately no details are provided on the form of the
macro-models used but it can be assumed that similar methods are used to both [SJ01]
and [JTB04], i.e. the use of least squares regression to find the coefficients for functions
that have linear and possibly higher-order terms in the parameters listed above.
Finally, the work in [CW06] offers macro-models for multipliers implemented in the
embedded multiplier blocks of FPGAs, in contrast to the macro-models in [SJ01; JTB04;
RSN06] that are intended for multipliers implemented in LUTs. The authors propose
using the average activity over the two multiplier inputs A as the only parameter for the
macro-model, though they note that the functional form of the macro-model itself may be
non-linear, i.e.
Power = f(A) (2.36)
where f is a potentially non-linear function. The form of f is not stated in the paper but
2.3 Estimation of power consumption of digital circuits 87
it can be seen that the power consumption estimates made by the proposed model are
non-linear in terms of A. Once again the complexity of the model used is not justified.
Macro-modelling summary Macro-models for FPGA power consumption presented
in previous work have concentrated on the estimation of the power consumed in adders and
multipliers and have shown high levels of accuracy for the characterisation data used. The
macro-models described above all achieve Root Mean Square errors of less than 15% when
their power consumption estimates are compared to those from switch-level simulation.
These low levels of error achieved are promising considering the speed-up achieved by
the models which are formed from closed form equations and are hence many orders of
magnitude faster than simulation-based techniques with capacitance data obtained from
routed circuits.
Unfortunately however rather than using knowledge of the circuit’s construction
and input word-length information to construct the models the authors resort to regression
to characterise high-order (e.g. cubic) polynomials in terms of the activity and word-length
parameters used to estimate power consumption, without justification for the order of
polynomials used. This leads to unnecessarily complex equations that require up to 17 or
20 coefficients to be characterised in [SJ01] and [JTB04] respectively (n.b. the forms of
the models used in [RSN06] and [CW06] are unknown), where each of these coefficients
must be used for each power consumption estimate.
Another concern with these models is the parameters chosen to model the activities
of the input signals to components. All the macro-modelling techniques described above
use averages of the activities of the input bits across a word except [RSN06] which use
the percentage of input bits that change value per cycle, which is inherently linked to
the average activity of the input bits. Work in [JTB04] has shown however that input
signals with input imbalance (i.e. groups of input bits that exhibit similar behaviour within
each group but different behaviour between groups) can cause significantly different levels
of power consumption in components compared to the power consumption estimated by
macro-models that use the average of the input signal activities parameter.
88 LIST OF TABLES
When macro-models using average activity parameters were used to estimate the
power consumed in components with input signal imbalance in [JTB04], 32% of the esti-
mates made were outside of a ±25% error range. This is because different input bits in
these components affect different sized portions of the component itself and hence some
have greater effects on power consumption than others. The DBT model, described earlier
within this section, showed that Gaussian signals (whose PDF is typical of signals seen
in DSP systems) have three different regions of bit-level behaviour and so these clearly
exhibit input imbalance, hence suggesting that the above macro-modelling techniques are
less accurate for application to power estimation of DSP circuits.
Additionally none of the macro-models described are able to account for the power
consumed in their target components when the input signals driving those components
contain glitches. Instead all input signals are assumed to be glitch-free signals, though in
practice it is not always possible to pipeline an algorithm to the extent that all components
are registered, due to either system input to output latency constraints or feedback loops
within the system.
Finally, the macro-modelling techniques discussed within this section allow for the
dynamic power consumed within adders and multipliers implemented on FPGAs to be es-
timated, but they do not allow the power consumed in the routing wires that connect these
components to be estimated. Work in [SKB02] presented results that indicate that rout-
ing power in FPGA circuits is high at 60% of total dynamic power consumption and as a
result it should be incorporated into power estimates for word-length optimisation so that
the word-lengths of these connecting signals can be targeted. As shown in Section 2.2.2
placement and routing are too computationally expensive for use within word-length op-
timisation however, and hence no fast and accurate technique for estimating the power
consumption in these routing wires is available in existing work.
2.3.3 Summary of power consumption estimation
Fast and accurate estimation of dynamic power consumption has been approached in
several pieces of work which have taken advantage of interesting abstractions away from
2.4 Summary of existing work 89
the low-level details of a device in order to reduce the complexity of making a power
consumption estimate. In general however the more accurate techniques available require
several steps of the FPGA design flow to have been executed in order for more low level
information on a design to be available when making a power consumption estimate,
whilst faster high-level estimates are most accurate for specific input signals that may
not encompass the full needs of particular application domains. Table 2.4 summarises the
methods for power consumption estimation in FPGAs that were described in this section.
2.4 Summary of existing work
This chapter has summarised existing work in the fields of word-length optimisation and
power consumption estimation, as well as summarising the complexity of implementing
the real-time algorithms targeted by word-length optimisation in this work on high per-
formance reconfigurable hardware.
Considerable research has been conducted in the field of methods for word-length
optimisation, however the underlying complexity of the problem due to the size of the
search space available remains. In the area of power consumption estimation, fast tech-
niques for accurately estimating the dynamic power in an entire system have not yet been
approached.
This thesis presents a novel method for performing word-length selection that of-
fers advantages over existing work in this field by calculating a tight lower-bound on
the cost achievable during word-length optimisation. Additionally the work presented in
the following chapters forms the first method for estimating the entire dynamic power
consumption in implementations of an FPGA-based algorithm with little computational
expense, allowing multiple word-length optimisation for power consumption minimisation
to be performed for the first time.
90 LIST OF TABLES
T
ab
le
2.4:
A
com
p
arison
of
p
ow
er
estim
ation
tech
n
iq
u
es.
T
ech
n
iq
u
e
D
esign
level
M
eth
o
d
C
om
m
en
ts
Sw
itch-level
sim
ulation
[X
il07;
L
L
H
+
05]
P
ost-place
and
route
Sw
itch-level
sim
ulation.
M
ost
accurate
and
com
putationally
expensive
technique.
T
ransition
D
ensity
[P
Y
W
02]
P
ost-place
and
route
P
robabilistic
estim
ation
of
signal
activities.
A
voids
sim
ulation.
Still
need
placed
and
routed
design.
C
an
over-estim
ate
activities
due
to
zero-delay
m
odel.
C
apacitance
estim
ation
[F
W
A
W
05;
A
N
04]
P
ost-m
ap
[F
W
A
W
05],
post-placem
ent
[A
N
04].
N
et
fan-out
[F
W
A
W
05],
net
fan-out,
bounding
box
and
device
specific
param
eters
[A
N
04]
A
voids
routing
[A
N
04]
and
placem
ent
[F
W
A
W
05].
Fan-out
alone
[F
W
A
W
05]
is
an
inaccurate
estim
ate
of
capacitance.
B
oth
pre-placem
ent
stages
and
placem
ent
still
com
putationally
expensive.
D
B
T
m
odel
[L
R
96]
H
igh-level
activity
m
odel
E
stim
ates
bit-level
activity
from
w
ord-level
statistics.
A
voids
low
-level
sim
ulation.
Statistics
need
to
be
m
easured
once
only.
A
ssum
es
G
aussian
signal
P
D
F
s,
typical
of
D
SP
system
s.
C
annot
account
for
glitches
in
signals.
M
acro-m
odel
[SJ01]
H
igh-level
m
odels
of
adders,
A
L
U
unit,F
IR
and
IIR
filters
U
ses
average
activity
values
of
grouped
input
bits.
U
p
to
cubic
polynom
ial
term
s.
C
om
ponents
w
ith
different
input
w
ord-lengths
m
ust
be
characterised
as
different
blocks.
C
ubic
com
plexity
unjustified.
N
o
input
glitches.
N
o
inter-com
ponent
routing
pow
er.
M
acro-m
odel
[JT
B
04;
R
SN
06]
H
igh-level
m
odels
of
adders
and
m
ultipliers
Input
w
ord-lengths,
average
activity
values
of
inputs
[JT
B
04]
or
input
ham
m
ing
distance
[R
SN
06].
U
p
to
cubic
polynom
ial
term
s
[JT
B
04].
C
ubic
com
plexity
unjustified
[JT
B
04].
N
o
input
glitches.
N
o
inter-com
ponent
routing
pow
er.
M
acro-m
odel
[C
W
06]
H
igh-level
m
odels
of
m
ultipliers
im
plem
ented
in
L
U
T
s
N
on-linear
function
of
average
activity
values
of
input
bits.
C
om
plexity
unjustified.
N
o
input
glitches.
N
o
inter-com
ponent
routing
pow
er.
91
Chapter 3
Implementation details
This chapter aims to provide a summary of the tools used for the implementation of the
word-length optimisation technique described in this thesis as well as the tools used to
extract power consumption measurements for the purposes of power model characterisation
and evaluation in Section 3.1. Additionally in Section 3.2 simple models for estimating
the area of the arithmetic components of a DSP algorithm implemented on an FPGA are
described.
3.1 Hardware and software used
Work presented in the following chapters contains a number of values for simulation or
algorithm run times that were all obtained on a desktop computer containing an Intel
Pentium 4 processor running at 3GHz with 1.5GBytes RAM. Additionally the software
packages summarised below were used to describe and create FPGA implementations of
the DSP algorithms targeted in this work, and to implement the various power estimation
and optimisation techniques described in this thesis.
3.1.1 Matlab and Simulink 2006a
Matlab [Mat6a] is a computing environment and programming language in which the
power models, optimisation techniques and supporting methods described in this thesis are
92 LIST OF TABLES
implemented. The environment contains the graphical block diagram tool Simulink which
allows the modelling, simulation and analysis of block diagram descriptions of systems
containing blocks from a variety of libraries, one of which is the System Generator package
provided by Xilinx described in the following section.
3.1.2 System Generator 9.1
The System Generator block library for Simulink [Xil08b] provides blocks that are typi-
cally used for the implementation of DSP and Communications systems, from arithmetic
components such as adders and multipliers to more complex systems such as LTI filters,
Discrete Fourier Transforms and Vitterbi encoders and decoders. Blocks for each of these
operations or systems can be placed within a Simulink diagram and connected in an arbi-
trary fashion via signals that can contain delay registers if necessary.
As well as the components and connections of an algorithm, the System Generator
tool allows the specification of the rounding schemes and the word-lengths of the signals
within that algorithm. Thus described the block diagram can be simulated, allowing
bit-accurate signals between each component according to the rounding scheme and word-
lengths used to be gathered.
System Generator can generate VHDL from a block diagram which can then be
synthesised, mapped, placed and routed onto an FPGA using the software packages de-
scribed in the following subsections. Thus System Generator diagrams form the design
entry stage (see Section 2.2.2) of the DSP algorithms targeted in this work.
The work presented in this thesis is tightly integrated with the System Generator
tool, analysing systems described and simulated by it and then using the power models and
word-length optimisation techniques described in subsequent Chapters to select optimal
word-lengths for the signals in the System Generator block diagram to minimise system
dynamic power consumption.
3.2 Construction of arithmetic components 93
3.1.3 Synplicity FPGA synthesis tools
The Synplicity FPGA synthesis tool Synplify Pro (version 8.8) [Syn06] is the logic synthesis
tool used to convert a VHDL description of a DSP algorithm designed in System Generator
into a netlist of logic gates that must then be mapped, placed and routed onto an FPGA
by the Xilinx ISE tools that are summarised in the following section.
3.1.4 Xilinx ISE 9.1
The Xilinx ISE tools are used in this work to perform the mapping, placement and routing
steps summarised in Section 2.2.2 after logic synthesis. As well as providing a bitstream
that can be programmed onto a Xilinx FPGA these tools provide information on the
system clock speed and the logic area and routing resources consumed by the placed and
routed system. As described in the background to this thesis, once an algorithm has
been placed and routed onto an FPGA its power consumption can be estimated using
switch-level simulation to gather activity values, in this case by using Mentor Graphics
ModelSim SE 6.2c [Gra06], and then combining these with capacitance values by using
Xilinx XPower [Xil07].
3.2 Construction of arithmetic components
In order to begin to understand and then estimate the power consumption of arithmetic
components in FPGAs it is critical that we first study the construction of these com-
ponents. In this section typical implementations of the arithmetic components essential
to signal processing on FPGAs will be presented. Additionally simple area models for
these components are detailed that will be used during a ‘rough placement’ technique de-
scribed in Chapter 5 for estimating the power consumed in the routing wires that connect
arithmetic components, and for word-length optimisation for area minimisation which is
compared to word-length optimisation for power consumption minimisation in Chapter 6.
94 LIST OF TABLES
B
A
Ci
C
o
S
LUT
sel
0 1
P
(a)
B(n)
A(n)
Ci(n)
S(n)FA
B(2)
A(2)
Ci(2)
S(2)FA
B(1)
A(1)
Ci(1)
S(1)FA
S(0)
(b)
Figure 3.1: (a) A full adder as implemented on a Xilinx FPGA. As shown, the XOR
of the inputs A and B is computed in a LUT, whilst the remainder of the necessary
logic is performed in the carry logic provided. (b) A chain of full adders that form a
ripple-carry adder.
3.2.1 Addition
For two’s complement fixed point inputs, addition in FPGA devices is typically performed
using a ripple-carry implementation where pairs of input bits are summed with a carry-in
signal in order to create a sum bit and a carry-out signal that is delivered to the sum of
the next pair of input bits.
Figure 3.1(a) depicts this addition of two input bits and a carry-in signal, as im-
plemented on Xilinx FPGAs which have been used as the target devices for experimental
work in this thesis.
The diagram in Figure 3.1(b) shows how multiple full adders are arranged to form
a carry chain. The sum outputs of a ripple carry adder can be optionally registered using
the registers within each SLICE that an adder occupies, at no extra cost in area. Referring
back to Figure 2.3(b), we can see that each SLICE in a Xilinx FPGA can contain two full
adders whose carry chain is connected, and hence each CLB can contain two pairs of two
full adders whose carry chain is connected.
Ripple-carry adders implemented in the FPGA fabric in this way form an essential
part of multiplication when embedded multipliers are not used, as will be seen when the
construction of multipliers is examined in Section 3.2.2.
3.2 Construction of arithmetic components 95
0 5 10 15 20 25 30
0
2
4
6
8
10
12
14
16
18
Adder input word−length
A
re
a 
(S
LI
CE
s)
Figure 3.2: The above plot shows the area in SLICEs occupied by an adder on a Xilinx
Virtex II FPGA for input word-lengths between 2 and 32 bits.
Area consumed by addition
Figure 3.2 presents data showing the area in SLICEs of the ripple-carry adders imple-
mented on the Xilinx Virtex II FPGA. We can see that the area of adders increases by
one SLICE for every two bits of adder input word-length, as two full-adders can be im-
plemented in each slice. The area of an adder can be estimated easily by using a linear
approximation to the data shown in Figure 3.2.
An important point to note is that when two signals are added, if one of the signals
has more LSB bits than the other then no logic is required to add these bits. Logic is only
needed for the bits that ’overlap’ (though sign-extension must be used to ensure the MSB
bits overlap in two’s complement), as shown in Figure 3.3. This means that of an adder’s
two input signals, the one with more LSB bits has no effect on the adder’s area, unless
the word-length of that signal becomes shorter than the word-length of the other signal.
3.2.2 Multiplication
Multiplication can be implemented either by using the logic blocks available on FPGAs,
or by using embedded multiplier blocks that are present in more modern devices. This
section will detail how this is done in both cases.
96 LIST OF TABLES
S
S
SS
pA
pB
nA
nB
adder
wl
+ B
A
Figure 3.3: The aligned input word-lengths of input signals A and B of an adder.
The number of full adders required is adderwl, whilst the word-lengths and scalings
of signals A and B are nA, nB and pA, pB respectively. The triangle represents the
binary point and S represents the sign bit of each signal, which in the case of signal
A must be sign extended (shown by grey boxes). As signal A has fewer LSB bits it
alone determines the number of full-adders needed to add the two signals, whilst the
extra two LSB bits of B are available ’for free’.
Multipliers implemented using standard logic
Figure 3.4 shows how fully parallel multipliers are efficiently constructed on Xilinx FPGAs.
By using extra gates available outside each LUT as well as the LUT itself, one carry chain
of SLICEs can be used to form the partial products of two bits of the multiplicand signal
multiplied by the multiplier signal, before summing these two partial products using the
carry chain. All sums of partial products are then added together to form the product of
the multiplier and multiplicand using a tree of adders, each implemented as using a carry
chain of SLICEs.
A multiplication with an m-bit wide multiplier and n-bit wide multiplicand requires
dn/2e carry chains that are m bits long to form the sum of pairs of partial products and
dn/2e − 1 carry chains that are m-bits long to form the tree of adders.
Embedded multipliers
Xilinx FPGAs since the Virtex 2 have incorporated blocks in the FPGA fabric that are
dedicated to two’s complement multiplication. This is because despite the efficient partial
product generation that can be achieved by using the extra gates and carry logic outside
of the LUTs in an FPGA, multiplication still requires a large number of SLICEs (186 are
3.2 Construction of arithmetic components 97
PP PP PP PP
+ +
+
x*y(1) x*y(2) x*y(3) x*y(4)
x*y
Figure 3.4: A parallel multiplier as implemented on a Xilinx FPGA. The PP blocks
are partial product generators that compute the result of multiplying the multiplier
signal x with each bit of the multiplicand signal y. In one chain of SLICEs, two partial
products can be generated using the LUTs and then added using the SLICEs’ carry
chain. Each carry chain of SLICEs is depicted in the diagram using a grey box.
required for an 18× 18-bit multiplication).
Embedded multipliers in Xilinx FPGAs consist of dedicated circuitry that computes
the result of an 18× 18-bit two’s complement multiplication. The internal construction of
these blocks has not been revealed by the manufacturer, but presumably modern Booth-
recoded multipliers are used for their speed, area and power efficiency. Many important
multiplication structures of this kind are described in [Hua03]. The exact size of an
embedded multiplier compared to that of a CLB or SLICE is not known in the academic
community as manufacturers have chosen not to reveal such details, however in previous
work [Smi06] the estimate (3.1) has been used at the suggestion of personnel at Xilinx.
This estimate will also be used in this work where necessary.
embedded mult block area ≈ 13× SLICE area (3.1)
For multiplications larger than 18 × 18-bits, several embedded multipliers can be
used in a divide-and-conquer approach to generate products of sections up to 18× 18-bits
wide of the multiplication, which must then be summed to generate the product [BT02].
This requires some addition to be implemented in the FPGA fabric as well as use of
dm18e × d n18e embedded multipliers to perform the multiplication.
98 LIST OF TABLES
0
10
20
30
0
10
20
30
0
200
400
600
Multiplier
word−length
Multiplicand
word−length
A
re
a 
(S
LI
CE
s)
(a)
0
10
20
30
0
20
0
25
50
75
100
120
Multiplier
word−length
Multiplicand
word−length
A
re
a 
(S
LI
CE
s)
(b)
Figure 3.5: For input word-lengths between 2 and 32 bits, the above plots show: (a)
the area in SLICEs of parallel multipliers implemented in LUTs, and, (b) the area
in SLICEs of embedded multipliers, counting both embedded mult blocks and any
additional LUTs required, using the embedded multiplier to SLICE ratio in (3.1).
Area consumed by multiplication
Figure 3.5 presents data showing the area in SLICEs of the Xilinx Virtex II implementa-
tions of multipliers implemented in LUTs and in embedded multipliers.
The area of multipliers implemented in LUTs can be approximated by using an
equation of the form shown in (3.2), i.e. linear in the input word-lengths wlx and wly,
with one cross term, and characterisation coefficients as indicated by α, β, γ and δ.
LUT mult area = αwlx + βwly + γwlxwly + δ (3.2)
In the case of embedded multiplication however we can see that the relationship
between area and word-length is more difficult to describe due to the non-linearities that
occur when several embedded multipliers must be used to accommodate increases in word-
length. When work on word-length optimisation for area is described later in Chapter 6
we will see that the surface shown in Figure 3.5(b) will have to be stored as a look-up
table in order to estimate embedded multiplier areas.
99
Chapter 4
Models for power consumption in
arithmetic components
This chapter describes a set of models that allow fast estimation of the dynamic power
consumption of the addition and multiplication operations that are the building blocks
of DSP circuits. Typical ASIC and FPGA implementations of these operations require
complex combinational logic in which glitches can rapidly build up and cause high levels
of power consumption.
Section 2.3.2 highlighted that the most significant problems that must be faced
when estimating the power consumption of a CMOS circuit are the data dependence
of activity values and the complexity of obtaining capacitance values. Additionally as
summarised in Section 2.1.1 the complexity of word-length optimisation problems and the
size of the design space place very tight constraints on the computational effort that is
necessary to estimate the power consumed in a circuit.
Power consumption macro-modelling has been highlighted as a technique whose
high-level of abstraction allows for extremely low levels of computational complexity. Ex-
isting macro-models for arithmetic components in FPGAs have a number of shortcomings
however that were described in Section 2.3.2: i) the models are entirely empirical in na-
ture and lack justification for the high order polynomials used that have large numbers
of terms, ii) average switching activity values across word-level inputs are used which are
100 LIST OF TABLES
inaccurate for signals typically used within DSP systems that exhibit different regions of
bit-level activity, iii) input signals containing glitches are not modelled, and iv) the power
consumed in the routing wires between components is not modelled.
The aim of this chapter is to introduce a set of macro-models that address problems
i) and ii) above: i.e. models that use a small number of terms whose form is due to known
phenomena that occur within the arithmetic components, and that use a signal activity
model that accounts for the different regions in the bit-level representation of a signal.
Thus the models presented here are developed from a sound theoretical basis and enhanced
by practical measurements in order to ensure their validity.
Of the remaining problems with existing macro-models that are not covered in this
chapter, iv) is dealt with in Chapter 5 where a method for estimating the power consumed
in the inter-routing wires between components is presented. Unfortunately no method has
been developed for estimating the power consumed in components whose inputs contain
glitches, though insight into the complexity of doing so is given prior to the conclusion of
this chapter.
The main contributions and associated publications of the work described in this
chapter are:
• Models that are able to estimate the activity propagated through adders and multi-
pliers implemented in the logic fabric of FPGAs (published in [CCC07]),
• Use of these models, along with device-level measurements and switch-level simula-
tion of components, in order to determine the relationship between input word-length
and activity and the power consumption of adders and multipliers, and,
• A set of macro-models that use a minimal set of high-level parameters to estimate
power consumption, based on the above analysis (published in [CGC05; CGCC06]).
In the remainder of this chapter models for estimation of activity within adders are
presented first and then used to build macro-models of their power consumption, activity
models and macro-models for multipliers are then described afterwards.
4.1 Estimating the activity within adders 101
4.1 Estimating the activity within adders
The critical difference between the dynamic power consumption and the area occupied by
digital logic is that, unlike area, dynamic power consumption is affected by the activity in
the signals passing through a circuit.
In order to estimate the power consumption in addition, it is necessary to study
how switching activity at the inputs to an adder causes further activity to propagate
through the carry chain and other logic elements in an adder before reaching the adder
outputs. This section is dedicated to developing models for this purpose. Once this has
been established it will be possible to build models that abstract away from the details
of activity propagation through individual logic blocks in an adder and instead use a
high-level model for power estimation that is described Section 4.2.
4.1.1 Transition density model for Xilinx Adders
The transition density method from [Naj93] for propagating activity estimates through
a circuit was summarised in Section 2.3.2 and has been used in a variety of situations,
including in FPGA applications [PYW02], to provide activity estimates within a circuit.
Although it allows activity to be estimated without performing a time-consuming switch-
level simulation of a circuit, the process of propagating transition density estimates through
a circuit can still be computationally expensive for large circuits. Additionally, because
delays within a block of logic are not known to the transition density model, activity
estimates made can have potentially large errors due to the inability to predict the effects
of these internal delays.
To avoid the complexity of propagating transition density estimates through a
circuit, we have applied the method to Xilinx adders directly as the first step towards
developing an analytical equation for adder output activity. We have then characterised
this model using device level measurements in order to account to some extent for the
internal logic delays that the transition density model alone cannot account for.
In Section 3.2.1 the construction of a 1-bit adder on Xilinx FPGAs was explained
102 LIST OF TABLES
B
n
A
n
C
n-1
C
n
S
n
LUT
sel
0 1
L
n
Figure 4.1: A full adder as implemented on a Xilinx FPGA. Signal Ln is the XOR of
the inputs An and Bn, computed in a LUT, whilst Cn−1, Cn and Sn are the carry-in,
carry-out and sum signals of this full adder, respectively.
and depicted in Figure 3.1(a), reprinted here in Figure 4.1 for clarity. As was described,
modern FPGAs contain additional circuitry to allow fast-carry chains to be formed, as well
as the Look-Up Tables (LUTs) used to implement general logic functions. When applying
the transition density model to the adder circuit in Figure 4.1, we treat the LUT, carry-
chain MUX and sum XOR gate separately, resulting in a model where transitions in the
input Bn and signal Ln (the output of the LUT) are uncorrelated upon arrival at the
carry-chain MUX, due to the different delays along each wire for those two signals, in line
with [Naj93].
Equations (4.1) and (4.2) show the transition density equations for the carry-out
(Cn) and sum (Sn) outputs of full adder n in a carry chain, in terms of the adder inputs
An and Bn and the carry-in signal Cn−1. As expected the activity in each 1-bit adder is
dependent on the activity propagated to it from the previous stage in the carry chain, as
well as that propagated from its inputs.
D(Cn) = P (An ⊕ Cn−1)[D(An) +D(Bn)]
+ P
(
An ⊕Bn
)
D(An)
+ P (An ⊕Bn)D(Cn−1)
(4.1)
D(Sn) = D(An) +D(Bn) +D(Cn−1) (4.2)
Estimating activity in an adder circuit using this model would require separate
4.1 Estimating the activity within adders 103
calculations to be made for the carry-out and sum signals of each bit of an adder, making
this method less attractive for word-length optimisation due to its computational cost
which is Θ(N), where N is the number of full adders in a ripple-carry adder. Also note
that the calculation of P (An ⊕ Cn−1) becomes complex to evaluate when there is some
correlation between An and Cn−1 (i.e. due to correlation between the An input in this
stage and those A and B inputs in earlier stages in the carry chain).
In summary, the activity throughout a ripple-carry adder formed of N full adders
implemented on a Xilinx FPGA can be estimated using (4.1), (4.2), and the following
information:
• the transition densities of the inputs: D(An) and D(Bn), 1 ≤ n ≤ N ,
• the input correlation probabilities: P (An ⊕Bn) , 1 ≤ n ≤ N ,
• the input correlation probabilities: P (An ⊕ Cn−1) , 1 ≤ n ≤ N .
This is a relatively large amount of information to gather and store for each full
adder in a circuit. Gathering bit-level activities for a circuit poses a particular problem
as switch-level simulation of a circuit is computationally expensive and hence undesirable
when fast models for optimisation are required. A power consumption macro-model built
using bit-level parameters would be overly complex and would require high levels of char-
acterisation, instead we would like to understand how the activity within adders relates
to high-level signal information that can be used to make simpler macro-models
As will be seen in the following section, the model described and the information
required can be greatly simplified by making some assumptions about the profile of the
activity of the (word-level) input signals to each adder.
4.1.2 Incorporation of signal activity profile information
The bit-level activity profile of signals in DSP systems has been studied extensively in
previous academic work, as was described in the DBT work in Section 2.3.2. By using the
results of this work it becomes possible to reduce the complexity of the signal correlation
104 LIST OF TABLES
calculations described in the previous section and to generalise the transition density model
for use across sections of an adder that exhibit different activity behaviour.
In the DBT model [LR96] the authors found that the activity of typical two’s
complement signals in DSP systems could be approximated by the activity of Gaussian
signals with zero mean, a particular standard deviation σ, and lag-1 autocorrelation ρ.
The authors showed that the bit-level activity of this set of Gaussian signals could be
broken up into three regions where the bits within each region exhibit similar behaviour.
The three regions are, in order from LSB to MSB: 1) spatially and temporally uncorrelated
LSB bits, 2) a region with increasing spatial correlation towards the MSB, and, 3) the
sign bit(s) at the MSB end of the signal.
The breakpoints BP0 and BP1, between regions 1 and 2 and regions 2 and 3,
respectively, and the activity of the bits in region 3, denoted Tmsb, can be estimated using
only the parameters σ and ρ that define the signal, as shown by (4.3)-(4.5).
BP0 = log2 σ + log2
(√
1− ρ2 + |ρ|/8
)
(4.3)
BP1 = log2 (|µ|+ 3σ) (4.4)
Tmsb =
1
pi
cos−1(ρ) (4.5)
Note that the bit-level activities and correlations of a signal are completely defined
by the DBT model. By assuming that the input signals to an adder have profiles as defined
by the DBT model, the bit-level information required to evaluate the transition density
model for adders described by (4.1) and (4.2) is known implicitly due to the activity
profiles of the regions in the DBT model.
As a result we can establish the activity in each region of a ripple-carry adder
without resorting to measuring bit-level activities. A general activity model can be created
for each region of the adder, as follows.
For those bits of the adder where both inputs A and B are in Region 1, we know
that all input bits are uncorrelated and have an activity of 0.5, thus An and Cn−1 are
4.1 Estimating the activity within adders 105
uncorrelated and P (An) = D(An) = 0.5 so:
P (An ⊕ Cn−1) = P (AnCn−1) + P (AnCn−1)
= P (An)P (Cn−1) + P (An)P (Cn−1)
= 0.5P [(Cn−1) + P (Cn−1)] = 0.5
(4.6)
Similarly An and Bn are uncorrelated and P (Bn) = D(Bn) = 0.5, so P (An ⊕ Bn) = 0.5.
As a result the carry-out and sum transition density equations simplify to:
D(Cn) = 0.75 + 0.5D(Cn−1) (4.7)
D(Sn) = D(An) +D(Bn) +D(Cn−1) = 1 +D(Cn−1) (4.8)
For those bits of the adder where both inputs A and B are in Region 3, we know
that all the A inputs are identical (i.e. fully correlated) and have activity TAmsb , and all
B inputs are identical and have activity TBmsb . Both TAmsb and TBmsb are calculated from
the lag-1 autocorrelation of signals A and B, i.e. ρA and ρB, respectively, as shown in
[BHS98]. If the word-level cross-correlation between the signals A and B is ρab, then the
probability that these MSB (i.e. sign) bits of A and B are of different value during a clock
cycle, P (Amsb⊕Bmsb), can be calculated in the same way as the activity of the sign bits of
a Gaussian signal with lag-1 autocorrelation ρ, as shown in [BHS98]. So P (Amsb ⊕Bmsb)
is given by:
P (Amsb ⊕Bmsb) = 1
pi
cos−1(ρab) (4.9)
For simplicity we assume there is no correlation between the inputs to this region
and the inputs to earlier regions, and that as a result the A and B inputs in Region 3 are
uncorrelated with the carry-in signal C1 into the first full-adder in this region. We also
assume that P (C1) = 0.5 as P (Cn) increases to this value in the part of the adder where
both A and B are in region 1. Hence for each 1-bit adder we know that:
Cn−1 = An−1Bn−1 + (An−1 ⊕Bn−1)Cn−2 (4.10)
106 LIST OF TABLES
where An−1, Bn−1 and Cn−2 represent the adder inputs and carry-in input from the
previous adder stage in the carry-chain. However as both the A and B word-level inputs
are in Region 3 in this part of the adder we know An−1 = An and Bn−1 = Bn, so:
P (AnCn−1) = P
(
An (An−1Bn−1 + (An−1 ⊕Bn−1)Cn−2)
)
= P (AnBnCn−2)
(4.11)
Where Cn−2 can be replaced recursively using (4.10) until we get P (AnBnCn−1) =
P (AnBnC1), and similarly we obtain P (ACn−1) = P (An Bn C1). So in the section of
the adder when both the A and B inputs are in Region 3 of the DBT model:
P (An ⊕ Cn−1) = P (AnCn−1) + P (AnCn−1)
= P (AnBnC1) + P (An Bn C1)
= 0.5[P (AnBn) + P (AnBn)]
= 0.5P (An ⊕Bn) = 0.5P (Amsb ⊕Bmsb)
(4.12)
So for this section of the adder:
D(Cn) = TAmsb (1− 0.5P (Amsb ⊕Bmsb))
+ P (Amsb ⊕Bmsb)(0.5TBmsb +D(Cn−1))
(4.13)
D(Sn) = TAmsb + TBmsb +D(Cn−1) (4.14)
Both (4.7) and (4.13) are recurrence relations, and by solving these we can obtain
closed-form expressions for the transition density at the carry-out of the nth bit in the
section of the adder where both inputs are in Region 1, and for the carry-out in the nth
bit in the section where both inputs are in Region 3. These closed-form expressions are
defined as follows.
Given the following word-level parameters of the input signals A and B to the adder:
their lag-1 autocorrelations ρa, ρb, standard deviations σa, σb, and lag-0 cross-correlation
ρab, the breakpoints between regions 1 and 2 (ABP0) and regions 2 and 3 (ABP1) in input
4.1 Estimating the activity within adders 107
signal A are defined by [LR96]:
ABP0 = log2 σa + log2
(√
1− ρ2a + |ρa|/8
)
(4.15)
ABP1 = log2 (3σa) (4.16)
and BBP0 and BBP1 are defined similarly for input signal B. Within the adder the upper
bound of the Region 1 model in (4.7)-(4.8) and lower bound of the Region 3 model in
(4.13)-(4.14) are then given by (4.17) and (4.18) respectively.
MINBP0 = min(ABP0 , BBP0) (4.17)
MAXBP1 = max(ABP1 , BBP1) (4.18)
The activities in the carry chain in regions 1 and 3 of the adder are calculated as follows.
For bits n = 1..MINBP0 , where both A and B are in Region 1:
D(Cn) = 1.5− 1.5(0.5)n (4.19)
For bits n = MAXBP1 ..W , where both input signals are in Region 3, and W is the total
input word-length of the adder:
D(Cn) = C1λi +
k(1− λi)
1− λ
where i = n−MAXBP1 + 1
λ =
1
pi
cos−1(ρab)
k = (cos−1 ρa(1− 0.5λ) + 0.5λ cos−1 ρb)/pi
(4.20)
We assume that for Region 3, C1 = 1.5, as from (4.19) we can see that the activity in the
carry chain in Region 1 increases towards this value.
By using (4.19) with (4.8) and (4.20) with (4.14) we can also build equations for
the transition density at the sum output of the n-th bit for Regions 1 and 3 of the adder,
whilst the activities for sum output bits in between these regions can be obtained by
interpolating between the last and first sum activities for regions 1 and 3 respectively.
108 LIST OF TABLES
1 5 10 15 20 25 30 33
0
0.5
1
1.5
2
2.5
3
Adder output bit
A
ve
ra
ge
 a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
 
 
Output activity from model
Input signal A
Input signal B
Figure 4.2: The output activity profile for a 32 bit adder, as estimated by the proposed
transition density-based method, before characterisation. The MSBs are on the right-
hand side of the x-axis. Input activity profiles are shown as dashed lines.
A typical output activity profile estimated using our model is shown in Figure 4.2,
along with the input activity profiles used, for a 32-bit addition. We can see that the
model predicts up to almost 2.5 output transitions per clock cycle for some output bits,
due to the combined contribution of the input activities and the carry chain. This is likely
to be much higher than actually occurs in an adder on a Xilinx device if the inputs to the
adder are registered (and hence input transitions arrive at the adder near-simultaneously),
as inertial delays within the adder will merge near-simultaneous transitions on the A and
B signals, and within the carry chain. To quantify the effects of inertial delays within an
adder, device-level measurements were made, and these were then used to characterise the
model, as described in the following section.
4.1.3 Model characterisation
We have available a Virtex 2 Pro device (an XC2VP30-FF896-7) on a Xilinx University
Program board from Digilent. This board is well suited to power consumption measure-
ments as it allowed us to measure the current drawn from only the 1.5V power supply
used by the internal FPGA fabric, and not that used by the I/O pins or the remainder of
the board.
4.1 Estimating the activity within adders 109
Signal 1
Signal 1 delayed by 4 clocks
Test signal: 32-bit uncorrelated Gaussian signal
 In 
 
Switch 1 Input selector signal
 In 
 
Switch 0 Enable adder input registers
z
-1
Output Register
sel
d0
d1
  z
-1
Mux1
sel
d0
d1
  z
-1
Mux
 
 Out 
Gateway Out
en
z
-1
Delay1
en
z
-1
Delay
32 bits switching every cycle
Constant 0
Control signal: Constant and always switching signals
a
b
a + b
AddSub
In Out
50 inverters
System
Generator
Figure 4.3: The test circuit used to measure total adder output activity.
To allow us to measure the activity at the output of an adder as accurately as
possible, rather than other power being consumed by the FPGA fabric, we used the test
circuit shown in Figure 4.3 which contains an adder driven by signals from on-chip memory.
Each output of the adder is connected to a chain of 50 inverters in order to maximise the
power consumed due to the transition activity at the output of the adder. A register
controlled with an enable signal from off-chip is placed between the input signal and the
adder in order to allow power consumption measurements to be made both when the adder
is not processing any data (so not consuming any power or making output transitions),
and when the adder is processing data (and consuming power). By subtracting the power
measured in the circuit when the adder is not operating from the power measured when
it is operating, we can isolate the power consumed by the adder and inverter chain only.
Using the Xilinx FPGA Editor program on the circuit after place and route we
disconnected the inverter chain from the output pins and then removed these. Several
modified versions of the resulting circuit were then made that enabled us to isolate the
power consumed due to activity in different parts of the adder and so measure:
• the total output activity for the adder, and,
• the total output activity for the adder when its carry chain is removed, i.e. for the
XOR of the input signals.
By comparing these two activity measurements to the activity predicted by our
110 LIST OF TABLES
transition density model when using the test signals provided we were able to quantify the
effect of the inertial delay of the adder logic that can cause a lower number of glitches than
predicted by the Transition Density model. As a result we have modified the model for
the activity at the sum output of the adder in (4.2) as follows, by adding the coefficients
α and β.
D(Sn) = α[D(A) +D(B)] + βD(Cn−1) (4.21)
The coefficient α accounts for the proportion of transitions that do not propagate
through the XOR gate implemented in the LUT of the adder, due to the inertial delay of
the logic, whilst the coefficient β accounts for the proportion of transitions that do not
propagate through the carry-chain due to its inertial delay. The two activity values taken
from the board-level measurements are used to calculate α and β.
Figure 4.4 shows the activity predicted at the output of a 32-bit adder by both
the un-characterised and characterised transition density models, given the input activity
profiles shown, and also shows the output activity estimated by the low-level simulation
tool XPower. The un-characterised model shows significant over-estimation compared to
XPower due to it not accounting for the inertial delay of both the input XOR gate and
the carry-chain. However the characterised model corresponds well with what is predicted
by XPower and should provide accurate bit-level estimates of adder output activity.
4.1.4 Transition density model results
To ensure our model is able to cope with a wide variety of input signals, we compared
the activity estimates made by it to those made by switch-level simulation and XPower
for a 32-bit adder driven by a range of Gaussian input signals with randomly selected
autocorrelation and standard deviation. Input signals 20000 samples long were used in
order to ensure the adder was properly exercised during simulation. The total activities
estimated by both our method and XPower are shown in Figure 4.5, and exhibit a high
correlation with Mean Relative Error of only 2.1% for the 500 pairs of input signals used.
In terms of computation time, for a 32-bit adder our activity model takes the time
4.1 Estimating the activity within adders 111
1 5 10 15 20 25 30 33
0
0.5
1
1.5
2
2.5
3
Adder output bit
Av
er
ag
e 
ac
tiv
ity
 (tr
an
sit
ion
s /
 cl
oc
k c
yc
le)
 
 
XPower
Characterized model
Uncharacterized model
Input signal A
Input signal B
Figure 4.4: The activity profile estimated using our model (before and after charac-
terisation), compared to the profile obtained through low-level simulation. The adder
input signal profiles are also shown as dashed lines.
16 17 18 19 20 21 22 23 24
16
17
18
19
20
21
22
23
24
Total activity estimated by XPower
To
ta
l a
ct
iv
ity
 e
st
im
at
ed
 b
y 
m
od
el
Figure 4.5: The total output activity estimated by XPower compared to the total
output activity estimated by our model for a 32 bit adder, using 500 different pairs
of input signals. The dashed lines show ±10% relative error margins.
112 LIST OF TABLES
required to evaluate a closed form equation, whilst estimation using low-level simulation
and XPower for vectors 20000 samples long takes 197.7 seconds. Hence our method is
several orders of magnitude faster than XPower, with a low enough computational cost to
be used within the inner loop of high-level power consumption optimisation.
4.1.5 Discussion
The high accuracy and low computational complexity of the activity models developed here
make them attractive for use within optimisation procedures, though the model described
can only be used to estimate the activity within a ripple-carry adder in a Xilinx FPGA. By
itself this information will not allow us to estimate the power consumed in the logic that
implements a ripple-carry adder as the exact size, layout and capacitance of the transistors
that form this logic have not been released by the manufacturer.
However the model developed demonstrates that the word-level signal parameters
used in the DBT model (i.e. the standard deviation σ and lag-1 autocorrelation ρ) com-
pletely describe the bit-level activity and correlation values necessary to estimate the power
consumption in an adder. This represents an important abstraction away from the low-
level details of power consumption estimation, whilst preserving high-levels of accuracy as
shown in Figure 4.5.
This result will be used in the following section to help select appropriate parameters
for macro-models of power consumption in adders that operate at a high-level, i.e. models
that only use word-level parameters. The activity model developed in this section will also
be used to demonstrate the relationship between adder power consumption and word-
length, which is of critical importance during word-length optimisation. The resulting
models will be able to estimate power consumption in these components quickly whilst
retaining high levels of accuracy.
Later in Section 4.3 when the power consumed in multiplication is considered, the
model described in this section will be revisited in order to study how the word-level
statistics and word-lengths of the inputs to multipliers (which are built using a number
of additions) affect their power consumption, in order to allow us to select appropriate
4.2 Macro-model for power consumed in addition 113
parameters for macro-models for multiplier power consumption estimation.
4.2 Macro-model for power consumed in addition
In this section a set of word-level parameters necessary to accurately model the power
consumption in adders will be determined. The activity model developed in Section 4.1
will be used to explain the effects of varying certain word-level parameters, and data from
simulation will be used to demonstrate these effects.
The contribution of the work in this section is the determination of a minimal set
of word-level parameters that can be used to estimate adder power consumption. Using
as few parameters as possible whilst retaining a high level of accuracy is critical when
building a model, as the number of parameters used rapidly increases the complexity of a
model and the amount of characterisation data required.
Work in the previous section showed that by using the DBT model word-level
statistics of input signals to an adder can be used to completely encapsulate input bit-
level activity and correlation hence allowing activity within the adder to be estimated
without measuring individual bit-level activities and correlations or performing switch-
level simulation. In this section we will extend this idea to allow the power consumption
in adders to be estimated using only word-level statistics.
As well as abstracting away from the individual signal activities within the adder, it
is necessary to abstract away from the low-level details of the adder’s construction, as this
affects its capacitance. Clearly the word-lengths of an adder’s inputs affect its structure
and hence its internal capacitances. As a result we will first study how the word-length of
the inputs to an adder will affect its construction and its power consumption, before we
begin to consider high-level abstractions for signal activities.
4.2.1 Relating power consumption to input word-length
In Section 3.2.1 the relationship between the area of an adder and its word-length was
presented in Figure 3.2, clearly showing that there is a linear relationship between the
114 LIST OF TABLES
two. As explained in Section 3.2.1, logic is only needed for the LSB-side input bits that
‘overlap’ (though sign-extension must be used to ensure the MSB bits overlap in two’s
complement), thus only the input word-length with fewer LSB bits affects the adder’s
area.
Under the assumption that the logic and routing components within an adder have
a uniform capacitance across different LUTs in an FPGA, we can deduce that the power
consumption of an adder can be estimated by multiplying the sum of activity values for
the components within an adder with appropriate capacitances for those components, as
shown in (4.22), for example:
adder power ≈ CLUT [ALUT (1) +ALUT (2) + ...+ALUT (n)]
+ CcMux[AcMux(1) +AcMux(2) + ...+AcMux(n)]
+ etc.
(4.22)
where CLUT , CcMux are the capacitances associated with switching LUTs and carry mul-
tiplexors, respectively, and similarly ALUT (i), AcMux(i) are the activities of the signals in
full-adder i of n that are associated with these components.
We can see that as the capacitance of components within the adders is uniform, the
effects of increases in word-length on power consumption can be estimated by accounting
for the corresponding increase in activity within the adder.
By studying Figure 4.6 we can see that increases in input signal word-length will
increase the number of bits that are in activity region 1 for these signals. This will cause a
corresponding increase in the number of full-adders whose internal activity is dictated by
this region. Only when the input word-lengths become small enough for one of the inputs
to have no region 1 bits does this situation change.
As a result we can expect the activity within adders to generally be linearly asso-
ciated with word-length, with slight deviations from this if the word-length of the shorter
word-length input becomes small enough for there to be no region 1 bits in that signal.
Hence to quickly estimate the power consumption in addition, Padd, a linear equa-
4.2 Macro-model for power consumed in addition 115
1 5 10 15 20 25 30 33
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Adder output bit (bit 33 is the MSB)
A
ve
ra
ge
 a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
 
 
Characterized model
Input signal A
Input signal B
Figure 4.6: The activity profile at the output of an adder as estimated using our
model. The adder input signal profiles are also shown as thin dashed lines.
tion such as the one in (4.23) can be used:
Padd = C1n+ C0 (4.23)
where C1 and C0 are coefficients that account for the activity and capacitance within the
adder, and n is the number of input bits that ‘overlap’ as was explained in Section 3.2.1.
Different values of C1 and C0 will be required according to the activity of the input
signals. As we saw earlier the bit-level activity in an input signal is well encapsulated
by the word-level statistics of the signal, and as a result the following sections will use
the adder activity model from Section 4.1 to find the minimum number of word-level
parameters necessary to allow values for C1 and C0 to be calculated that will accurately
estimate power consumption within adders.
4.2.2 Selection of macro-model statistical parameters
As has been described, the DBT work demonstrated that the bit-level activity of signals in
DSP systems can be approximated by the activity of zero-mean Gaussian signals, and that
word-level statistical parameters capture sufficient information to determine the bit-level
activities in Gaussian signals. In Section 4.1, we showed that to allow the activity within
116 LIST OF TABLES
an adder to be calculated, the bit-level activities and correlations of the inputs to an adder
must be known.
For a pair of Gaussian signals A and B, the following list describes the word-level
statistics that completely define the activities and correlations between bits in the signals,
and the specific bit-level effects of these statistics:
The standard deviations σa, σb. These affect the positions of the breakpoints between
the 3 regions of activity in the DBT model.
The lag-1 autocorrelations ρa, ρb. These affect the activities in regions 2 and 3 of the
signals.
The cross-correlation between A and B ρab. This affects the correlation between the
bits in regions 2 and 3 of signal A and those in regions 2 and 3 of signal B.
The lag-1 cross correlations between A and B. These are: ρab1, the cross-
correlation of A and the previous value of B, and, ρba1, the cross-correlation of B
and the previous value of A. The parameter ρab1 affects the correlation between the
current bits in regions 2 and 3 of A and the previous bits in regions 2 and 3 of B,
and vice versa for ρba1.
This is a total of seven statistical parameters which is a large number to record
for each adder in a system, and a power consumption model built using all of these
parameters would require a very large amount of characterisation data. If m sampling
points are chosen for each parameter, and all combinations of sampling points are to be
measured for each parameter, a total of m7 power consumption measurements would be
required. As such the remainder of this section will concentrate on studying the effects of
these parameters in an effort to remove those from the macro-model that do not have a
significant effect on accuracy when estimating power consumption.
4.2 Macro-model for power consumed in addition 117
Relationship between signal scalings and standard deviation
The standard deviation of a signal has a clear effect on the activity within that signal
as it causes the breakpoints between regions of different behaviour to shift. The DBT
model describes how these breakpoints and signal standard deviation are linked as shown
in (4.3)-(4.5).
As described in Section 2.1.2, during range-analysis signals are assigned individual
scalings that allow them to store a range of values that will avoid overflow, which has a
high noise penalty.
The range of values that a signal occupies and the standard deviation of that signal
are intrinsically linked. In the case of Gaussian signals that are used to model the activity
of signals during word-length optimisation, 99.7% of the values taken by a Gaussian signal
lie within a ±3σ range of the mean of the signal, where σ is the standard deviation of the
signal.
As each input signal to a component has been scaled before word-length optimisa-
tion when power consumption estimates are required, we can assume that, for example, a
±3σ range is covered by the chosen scaling of each signal. As a result the breakpoint BP1
(calculated using (4.4)) that is closest to the MSB of a signal is fixed in relation to the
MSB of that signal, and the gap between MSB and BP1 is the same for each signal after
range analysis and appropriate scaling. This means that the number of bits in region 3 of
the DBT model in each signal is i) fixed, and ii) small, i.e. ceil(log2(3)) = 2 bits.
This consistent gap between MSB and BP1 across all signals in a system allows us to
ignore both the scaling and standard deviation of signals in a system when using word-level
statistics to describe the bit-level activity of signals. As a result the power consumption
macro-models that we create will be simpler, requiring less data for characterisation, at
no loss of accuracy.
Note that the choice of using a range of ±3σ is arbitrary, as assuming that input
signals have been scaled to some range of ±nσ would achieve the same result. Using larger
values for n would decrease the probability of overflow when storing a Gaussian signal and
118 LIST OF TABLES
increase the number of region 3 bits used in each signal.
There is one exception where the scaling and standard deviation of a signal cannot
be ignored. This occurs in the case where the two inputs to a component must be aligned
in a specific way related to their scaling, as occurs for the inputs to an adder, whose
binary points must be aligned. In this case the signal with the smaller scaling must be
sign extended to the same length as that of the signal with larger scaling. As a result
differences in the scalings of the two inputs to an adder cause their activity regions to lie
in different places along the inputs of an adder. This must be taken into account in a
power consumption macro-model for adders.
As such for adders it is assumed that the input signal with the larger scaling (and
hence larger standard deviation), e.g. signal l, is well scaled as described above so that its
scaling and standard deviation can be omitted from the macro-model. For the signal with
smaller scaling (and hence standard deviation), e.g. signal s, the parameter σr is used in
the macro-model to represent its standard deviation relative to the standard deviation of
signal l by using (4.24).
σr =
σs
σl
(4.24)
Autocorrelation and cross correlation in signals
In the previous section we noted that the scaling of signals before word-length optimisation
results in a small, fixed-size activity region 3 for each signal.
As shown in Section 4.1 the word-level cross-correlation between two signals man-
ifests itself as correlation between the bits in region 3 of the two signals, and to a lesser
extent the bits in region 2 of the two signals. Word-level lag-1 auto-correlation within a
signal causes the activity within regions 2 and 3 of the signal to change.
We know that, for the well-scaled signals used in the systems under word-length
optimisation, region 3 represents two bits. Region 2’s extent is the difference between
4.2 Macro-model for power consumed in addition 119
breakpoints BP0 (4.3) and BP1 (4.4), i.e.
Region 2 bits = ceil
[
log2(3σ)− log2(σ)− log2
(√
1− ρ2 + |ρ|
8
)]
= ceil
[
log2(3)− log2
(√
1− ρ2 + |ρ|
8
)] (4.25)
The gap between the two breakpoints given by (4.25) is largest when the lag-1
autocorrelation ρ in a signal is ±1, at which point the gap is equal to
Max Region 2 bits = ceil
[
log2(3)− log2
(1
8
)]
= ceil[4.59] (4.26)
In practice however signals with ρ = 1 are constants, whilst non-constant signals
with |ρ| = .99 give a maximum region 2 size of ceil[3.22] bits, whilst the minimum region
2 size is 2 bits.
Cross-correlation between the A and B inputs to a full-adder has a small effect on
the activity in the carry chain of the adder as shown by the activity models used when
both A and B are in activity region 3 (4.13). The contribution made by the carry chain to
the overall activity within an adder is relatively small, as was shown by the characterised
activity model in Figure 4.4. As a result we can expect the effect on activity of word-level
cross-correlation between the input signals to an adder to be relatively small.
In contrast the lag-1 autocorrelation in a signal has a direct and significant effect
on the activity in regions 2 and 3 of a signal, and as such we expect it be an important
factor that should be included in the power consumption macro-model of an adder.
Figure 4.7 shows the effect on adder output activity of varying the lag-1 autocor-
relation and the cross-correlation word-level statistics of the inputs to an adder, when the
inputs are well scaled as defined above. As expected word-level cross-correlation has a
small effect on activity at the output of the adder, and we expect this to be reflected by
relatively small changes in adder power consumption. Lag-1 autocorrelation in the input
signals to an adder has a direct and significant on activity however, as shown.
XPower measurements have been made to verify that similar trends occur in the
power consumption of an adder to those predicted above. Figure 4.8 shows the variations
120 LIST OF TABLES
2 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
1.2
Bit number, bit 16 = MSB
A
dd
er
 o
ut
pu
t a
ct
iv
ity
2 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
1.2
Bit number, bit 16 = MSB
A
dd
er
 o
ut
pu
t a
ct
iv
ity
2 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
1.2
Bit number, bit 16 = MSB
A
dd
er
 o
ut
pu
t a
ct
iv
ity
(a) ρ
a
 = 0.99 ρb = 0.99 (b) ρa = 0 ρb = 0 (c) ρa = −0.99 ρb = −0.99
Figure 4.7: The output activity of an adder with input signals that are appropriately
scaled and that have a range of signal correlations, as estimated by the model in
Section 4.1. The input signal lag-1 autocorrelations change from (a) to (c) as follows:
in (a) ρa, ρb = 0.99, (b) ρa, ρb = 0, and (c) ρa, ρb = −0.99. The different coloured lines show
the output activities of the adder under different values of word-level cross-correlation
ρab between the adder inputs: the red line denotes ρab = 0.99, green denotes ρab = 0,
blue denotes ρab = −.99
in dynamic power consumption for a 16-bit adder caused by changing the value of the
auto-correlation coefficient ρb and the cross-correlation coefficient ρab. When ρb is varied
between −0.9 and +0.9, the maximum variation in the dynamic power consumption is
about 10%, whilst for the variation of ρab between −0.95 and +0.95, the variation is
always less than 5% and typically much smaller.
In practice it is clear that dynamic power consumption in an adder is affected to a
greater extent by the auto-correlation of its inputs than their cross-correlation. The results
suggest that it is possible to omit cross-correlation values from the power macro-model of
an adder without significantly affecting its accuracy. Lag-1 auto-correlation however must
be included due to its direct effect on activity in regions 2 and 3 of the input signals.
Lag-1 cross-correlations
The lag-1 cross correlations between A and B, ρab1 and ρba1, affect the correlation between
current values of A and previous values of B, and current values of B and previous values
of A, respectively. As with the other word-level correlation values, these cause bit-level
correlation only in regions 2 and 3 of the signals A and B.
By studying the transition density model for addition in Section 4.1.1, we can see
4.2 Macro-model for power consumed in addition 121
−1 −0.5 0 0.5 1
7
8
9
10
11
12
Cross−correlation coefficient ρ
ab
Po
w
er
 c
on
su
m
pt
io
n 
(µ
W
/M
H
z)
 
 
ρb = −0.9
ρb = −0.5
ρb = 0.0
ρb = 0.5
ρb = 0.9
Figure 4.8: Variation in the dynamic power consumption of a 16-bit adder obtained
through Xilinx XPower while the word-level cross-correlation ρab and lag-1 auto-
correlation ρb are varied. The other signal statistics are held constant: σa = 1.0,
σb = 1.0, ρa = 0.5.
that ρab1 and ρba1 have no effect on the activity within an adder, as the model indicates
that transitions that occur within the adder are due to:
• transitions in the bits of A and transitions in the bits of B, and,
• propagation of transitions through the adder, depending on only the current values
of the A and B bits and the correlation between these current values.
The correlation between current bits in A and previous bits in B (and vice versa)
has no effect on the probability of carry propagation, and no effect on activity in A or in
B. As such ρab1 and ρba1 need not be included in a macro-model for estimating the power
consumption in addition.
4.2.3 Summary
As a result of the adder activity model developed in Section 4.1 it has been possible to
determine the relationship between the word-length and power consumption of adders,
and the word-level statistical parameters that must be incorporated into a macro-model
to allow the fast and accurate estimation of their activity. The resulting model for the
122 LIST OF TABLES
power Padd consumed in addition is given by:
Padd = nC0(σr, ρa, ρb) + C1(σr, ρa, ρb) (4.27)
where σr is the standard deviation of the signal with smaller standard deviation relative
to that of the signal with larger standard deviation, ρa, ρb are the lag-1 autocorrelations
of input signals A and B, respectively, C0, C1 are characterisation coefficients that are
functions of the above statistical parameters, and are determined from training data, and
n is the number of input bits that ‘overlap’.
The small number of statistical parameters used to calculate C0 and C1 allow for
the model to be completely characterised with a small number of power consumption mea-
surements. XPower was used to gather sufficient data to characterise the model, allowing
several hundred data points to be collected in the space of an hour. To estimate the power
consumption for any values of σr, ρa, or ρb between those points used to characterise
the model linear interpolation is used to approximate the values of C0 and C1 from their
nearest recorded values.
The model has been tested by comparing the adder power estimated to the power
predicted by XPower, for signals of word-lengths between 4 and 32 bits with randomly
chosen signal statistics as shown in Figure 4.9. As can be seen Figure 4.9 the macro-
model shows a high degree of accuracy with a mean relative error in the estimates made
compared to XPower of 5.13%.
4.3 Macro-model for LUT Multipliers
Multiplications implemented in LUTs are constructed as was described in Section 3.2.2,
where two partial products can be generated in the logic of a column of LUTs and are
then summed in that column’s carry chain. A tree of adders is then used to accumulate
the summed pairs of partial products to generate the product of the two inputs.
Fortunately the transition density-based activity estimation model for adders that
was described in Section 4.1 can be applied, after some adjustment, to the problem of
4.3 Macro-model for LUT Multipliers 123
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10−5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x 10−5
Adder power estimated by XPower (W/MHz)
A
dd
er
 p
ow
er
 e
sti
m
at
ed
 b
y 
m
ac
ro
−m
od
el
 (W
/M
Hz
)
Figure 4.9: The estimated power consumption in various adders of word-lengths 4 to
32 bits, estimated by XPower (x-axis) and the proposed power macro-model (y-axis).
150 tests involving random input statistics were performed. The dashed lines show
±10% error margins.
estimating the activity at the outputs of the partial product adders and the adders in the
adder tree of a multiplier constructed in this fashion. This will allow us to model how the
activity within a multiplier constructed in this way changes with signal parameters and
word-length in order to facilitate the selection of the parameters for a multiplier power
consumption macro-model.
4.3.1 Multiplier activity model
The activity model for multiplication is built by first estimating the activity at the output
of the logic that generates and sums pairs of partial products. An adder activity model
based on the one developed in Section 4.1 is then required to estimate the activity in the
adder tree.
Activity model for partial product pair sum
Figure 4.10 shows the logic used to generate one bit of the sum and a carry out from
adding a pair of partial products in a Xilinx FPGA. The carry chain used to sum the pair
of partial products is identical to that used in addition, and as such the same activity
model as for addition can be used except that the different activities at Ln and An must
124 LIST OF TABLES
X
n-1
X
n
C
n-1
C
n
S
n
sel
0 1
L
n
LUT
Y
n-1 Yn
A
n
Figure 4.10: The logic used to generate one bit of the sum of a pair of partial products,
and a carry-out. Within a LUT, the bits Xn, Xn−1 of the multiplier signal X are
multiplied with two bits Yn, Yn−1 of the multiplicand signal Y , to create a pair of
partial products that are summed. An extra AND gate and the carry logic available
within a SLICE are required to calculate the carry-out from the sum of the pair of
partial product bits. Two bits of the sum of two partial products can be calculated
in the two LUTs and associated logic of a SLICE in a Xilinx FPGA in this way.
be accounted for.
Let us assume that the input bits in X and Y are in region 1 of the DBT activity
model, i.e. all bits have activity 0.5 and are uncorrelated with each other. As a result the
signal probabilities, activities and correlations of signals An and Ln can be calculated as
follows:
P (An) = P (Xn−1Yj−1) = 1/4 (4.28)
P (Ln) = P (XnYjXn−1Yj−1 +XnYjXn−1Yj−1) = 3/8 (4.29)
D(An) = P (Xn−1)D(Yn−1) + P (Yn−1)D(Xn−1) = 1/2 (4.30)
D(Ln) = P (Xn)D(Yn) + P (Yn)D(Xn)
+ P (Xn−1)D(Yn−1) + P (Yn−1)D(Xn−1) = 1
(4.31)
4.3 Macro-model for LUT Multipliers 125
The sum Sn and carry-out Cn densities are calculated as follows:
D(Cn) = P (Ln)D(An) + P (L)D(Cn−1) + P (An ⊕ Cn−1)D(Ln) (4.32)
D(Sn) = D(Cn−1) +D(Ln) (4.33)
Under the assumption that both input signals are in region 1 of activity of the DBT
model and that An is not correlated to Cn−1, D(Cn) and D(Sn) become:
D(Cn) =
5
8
× 1
2
+
3
8
×D(Cn−1) + 12 =
13
16
+
3
8
×D(Cn−1) (4.34)
D(Sn) = D(Cn−1) + 1 = 1 +D(Cn−1) (4.35)
Let us assume that the multiplier inputs X and Y are registered and that their
transitions arrive at the multiplier near-simultaneously, in which case the inertial delay
coefficients α and β from Section 4.1.3 must once again be used to calculate D(Sn).
D(Sn) = α+ βD(Cn−1) (4.36)
As in Section 4.1.2 a closed form equation can be calculated from the recursive
equation for D(Cn) to allow the activity at the output of the logic that sums pairs of
partial products to be estimated quickly. These activities are then used to estimate the
activity throughout the adder tree that sums all partial products to form the result of
the multiplication. The following subsection describes how the activities in the adder tree
itself are estimated.
Activity model for adder tree
The adders in the adder tree of the multiplier are constructed in the same way as those
for which an activity model was developed in Section 4.1. However we cannot assume that
the inputs to the adders in the adder tree are glitch-free, as glitches are propagated to
these adders from the partial product generators, and from previous adders in the adder
tree. Let us re-visit the adder transition density model from Section 4.1.1, re-stated in
126 LIST OF TABLES
(4.37) and (4.38), where Cn and Sn are the carry-out and sum outputs of full adder n in a
carry chain, respectively, in terms of the adder inputs An and Bn and the carry-in signal
Cn−1.
D(Cn) = P (An ⊕ Cn−1)[D(An) +D(Bn)]
+ P
(
An ⊕Bn
)
D(An)
+ P (An ⊕Bn)D(Cn−1)
(4.37)
D(Sn) = D(An) +D(Bn) +D(Cn−1) (4.38)
The transition densities of the inputs to the adders in the adder tree are known
as they are propagated from the outputs of previous components in the multiplier, how-
ever some assumptions must be made about the signal probabilities P (An ⊕ Cn−1) and
P (An ⊕Bn) to allow the transition densities of the outputs of the adders in the adder
tree to be calculated.
Assuming that An, Bn and Cn−1 are signals that contain a number of glitches that
occur at random intervals through the course of a clock cycle, we can deduce that the three
signals at the input to a full-adder are uncorrelated with each other, on average, over the
course of a clock cycle. As a result the probabilities P (An ⊕ Cn−1) and P (An ⊕Bn) are
equal to 0.5. Hence the adder transition density equations can be evaluated for every
full-adder in the adder tree of a multiplier by propagating activities through the tree
appropriately.
As the input transitions arrive at different times to the inputs of the adders in the
adder tree, the inertial delay coefficients α and β from Section 4.1.3 are not required when
calculating transition densities within the adder tree.
Equations (4.37), (4.38) and the signal probability values above completely define
the activity model for the adder tree of a multiplier. In order to accurately calculate the
output activity of a multiplier however, the alignment of the addition of partial products
must also be accounted for when estimating activities in the adder.
4.3 Macro-model for LUT Multipliers 127
Figure 4.11: The size, sign extension, and alignment of the partial products summed
in the adder tree of an 8× 8-bit adder. The left-most diagram shows the eight partial
products generated, where each black dot represents one bit in a partial product, and
partial products are arranged in rows with their MSBs on the left hand side. Adjacent
pairs of partial products that are not separated by a line are added together, where
in each case the top-most partial product needs to be sign extended (shown by a grey
dot). The middle diagram depicts the partial product sums that result from the first
figure that must also be sign-extended, aligned and added. Similarly the right-most
diagram shows the final addition that must be performed to calculate the product of
the inputs.
Figure 4.11 depicts the alignment of partial products at the inputs and internally
for the adders that form the partial product summation tree. Notice the sign extension
and partial product alignment that is necessary to correctly compute the product of the
inputs. By accounting for these details and the fact that we need only add partial product
bits that ‘overlap’ we can apply the adder tree activity model in (4.37) and (4.38) to each
adder in the adder tree, starting from those adders in the top of the tree and propagating
the activity at their outputs to the next level in the tree, and so on.
Results
The output activities estimated for multipliers using this method have been compared
to those calculated by XPower for a variety of multipliers with uncorrelated inputs and
different input word-lengths. A high level of correlation between the two is observed,
though some minor differences between the two occur due to the inertial delay of compo-
nents not being accounted for in the transition density based model, and due to the signal
probability assumptions made for P (An ⊕ Cn−1).
Figure 4.12 shows the activities estimated by both methods for multipliers with
several different word-lengths. It can be seen that the output activity of the multiplier
rises quickly between the LSB and the central bits in the output word, as an increasing
number of partial products begin to ‘overlap’. Activity peaks in the central bits of the
128 LIST OF TABLES
0 2 4 6 8 10 12 14 16
0
5
10
15
20
25
30
35
40
45
Output bit number (bit 16 is the MSB)
O
ut
pu
t a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
The output activity of an 8x8−bit multiplier
(a)
0 5 10 15 20 25
0
10
20
30
40
50
60
Output bit number (bit 24 is the MSB)
O
ut
pu
t a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
The output activity of a 16x8−bit multiplier
(b)
0 5 10 15 20 25 30
0
20
40
60
80
100
120
140
160
180
200
Output bit number (bit 32 is the MSB)
O
ut
pu
t a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
The output activity of a 16x16−bit multiplier
(c)
0 5 10 15 20 25 30 35 40 45
0
50
100
150
200
250
Output bit number (bit 48 is the MSB)
O
ut
pu
t a
ct
iv
ity
 (t
ran
sit
ion
s /
 cl
oc
k c
yc
le)
The output activity of a 32x16−bit multiplier
(d)
Figure 4.12: Curves showing the activity estimated using the transition density based
adder model (dashed lines) and switch-level simulation using ModelSim and XPower
(solid lines) for the output of multipliers of various word-lengths, driven by un-
correlated inputs. The input word-lengths used are as follows, stated in the form
multiplier ×multiplicand: (a) 8× 8, (b) 16× 8, (c) 16× 16, and, (d) 32× 16
output as all partial products ‘overlap’ at this point, and as a result glitches accumulate
through the sum output of all adders in the adder tree. Activity starts to decrease slowly
between the central bits in the output word and the MSB. The slow decrease is mainly
due to extra transitions propagated through the carry chains of each adder in the adder
tree towards the upper bits in each adder.
As the model developed is clearly able to account for the internal activity within a
multiplier implemented in LUTs, it will be used in subsequent sections to relate the power
consumption in multipliers to their input word-lengths and input signal statistics.
4.3 Macro-model for LUT Multipliers 129
4.3.2 Relating power consumption to input word-length
The activity estimates and measurements that were shown in Figure 4.12 exhibit several
trends that, in conjunction with the models themselves, allow us to relate the word-lengths
of the inputs to a multiplier to the activities within it.
The word-length of each of the two inputs to a multiplier has a distinct effect on
the multiplier’s construction and as a result the power consumed within the multiplier is
affected by these input word-lengths in different ways.
Let us first consider the effects of changing the word-length of the multiplier (i.e. not
the multiplicand) input signal. The multiplier is multiplied by each bit of the multiplicand
to form partial products that are then added. Hence increasing the word-length of the
multiplier signal increases the word-length of the partial products that are to be summed.
From the activity models developed, we know that increasing the word-length of an adder
causes a proportional increase in the activity (and hence power consumption) within the
adder. This is because activity does not accumulate in the carry chain of an adder,
but instead approaches an upper limit. Increasing the word-length of the adder merely
increases the number of full-adders whose activity is at this upper limit.
In the case of the adder tree of a multiplier, a similar phenomenon occurs. Increas-
ing the word-length of the multiplier signal increases the number of partial product bits
that ‘overlap’ when they are summed, due to their alignment. This can be visualised by
considering the diagram in Figure 4.11 and appending extra partial product bits onto the
right hand side of each partial product row.
As a result we can expect an increase in the number of output bits that are part of
the ‘peak’ of activity in that word. This can be seen if we compare the activity profile of
an 8× 8-bit multiplier, shown in Figure 4.12(a), to that of a 16× 8-bit multiplier, shown
in Figure 4.12(b). We can see that the peak level of activity is roughly the same in the
two multipliers, but there is a wider peak in the multiplier that has a greater word-length
for its multiplier signal. We see the same phenomenon when comparing a 16 × 16-bit
multiplier, shown in Figure 4.12(c), to a 32× 16-bit multiplier, shown in Figure 4.12(d).
130 LIST OF TABLES
In contrast the word-length of the multiplicand signal of a multiplier has a sig-
nificantly different effect on the activity within the multiplier. One partial product is
generated by multiplying each multiplicand bit with the multiplier, so increasing the num-
ber of multiplicand bits increases the number of partial products and hence the number of
additions needed to sum the partial products. Crucially however, the depth of the adder
tree is increased as multiplicand word-length increases, and as can be seen from (4.38)
activity is accumulated at the inputs to a full-adder and propagated through the sum
output. This can be seen when comparing the activity at the output of a 16×8-bit adder,
in Figure 4.12(b), to the activity at the output of a 16× 16-bit adder, in Figure 4.12(c).
Including an extra partial product to be summed in an adder tree requires an extra
adder to sum all partial products. However, the total activity in the adder tree increases
quadratically since as well as the activity within the new adder, the activity in adders in
deeper levels of the tree than the new adder is also increased, due to activity propagated
from the new adder. Put another way, the more levels there are within the adder tree, the
more adders in the tree have their activity increased by the addition of an extra partial
product, hence the quadratic relationship described.
To form a power consumption model for multipliers implemented in LUTs on an
FPGA, both the activity and capacitance within multipliers must be accounted for. As
with adders, capacitance of the logic elements within multipliers can be assumed to be
uniform across the LUTs that form the multiplier, and hence power consumption in the
LUTs of multipliers can be completely determined if the activity in the multiplier can be
reliably predicted and multiplied by appropriate coefficients to account for capacitance.
The power consumed in the configurable routing wires that connect the LUTs of a
multiplier together are likely to show significant variation however, due to the unknown
capacitance of these wires (at the optimisation stage). Variations in capacitance can arise
as a result of the following factors:
1. variations in the placement of the multiplier LUTs that result in different routing
requirements,
4.3 Macro-model for LUT Multipliers 131
2. variations in the routes themselves due to the use of simulated annealing-based
techniques for routing.
Of these two causes of variation in capacitance, the first can be circumvented by
using pre-defined placements of the LUTs of a multiplier that have typically been designed
and optimised by hand. Such pre-defined multiplier placements are used by the Xilinx
Core Generator for multipliers. Unfortunately variations in the actual routes used to
connect LUTs in a multiplier cannot be avoided unless pre-defined routing is also used.
There will as a result be some variation in the capacitance of the intra-routing wires of
multipliers, though as these are formed of columns of closely packed and tightly interlinked
LUTs it is expected that the capacitance of each of these wires will be small and show
only low levels of variation.
As a result, models for the power consumption in multipliers can be constructed in
a similar way to those for addition, assuming that capacitances within the multiplier are
constant. To accurately estimate the power in a multiplier, word-level statistical param-
eters that predict the activity within the multiplier should be used and scaled according
to multiplier word-length. The activity models described in this section have determined
that multiplier activity is linearly related to the word-length of the multiplier signal, and
quadratically related to the multiplicand signal, suggesting a power macro model of the
form:
Pmult = m2C3 + nC2 +m2nC1 + C0 (4.39)
where C3, C2, C1 and C0 are coefficients that account for the activity and capacitance
within the adder, m is the word-length of the multiplicand signal of the adder, and n is
the word-length of the multiplier signal of the adder.
4.3.3 Statistical parameters
Multipliers, like adders, have two inputs, so the same seven statistical parameters define
their bit-level activity and correlation. However, there are several statistical parameters
that we can omit for the reasons discussed previously in Section 4.2.2.
132 LIST OF TABLES
Firstly, the consistent scaling of signals achieved due to range analysis allows the
standard deviation of the input signals to a multiplier to be ignored during power con-
sumption estimation, as the inputs to a multiplier are not aligned to each other in relation
to their scaling. Secondly, we can identify that, as in adders, the lag-1 cross correlation
parameters ρxy1 and ρyx1 have no effect on the power consumption of a multiplier because,
as shown in the transition density model for multipliers, activity within the multiplier is
due to activity in the X and Y inputs and correlation between the current values of these
inputs.
This leaves the cross-correlation of the inputs and lag-1 autocorrelation of each
input. As shown in Section 4.1.2 the word-level cross-correlation of two signals affects the
bit-level correlation between the bits of those two signals that are in regions 2 and 3 of
the DBT model.
However, we know there are only a small number of region 2 and 3 bits in each
signal, due to the input signals to the adder being ‘well scaled’. There are only 2 bits in
activity region 3 of each signal where the bit-level effects of word-level cross-correlation
are most strongly felt, and thus a total of 2 × 2 partial product bits are formed from
these region 3 bits. However a total of multiplier × multiplicand partial product bits
are summed in the multiplier, so even if word-level cross-correlation affects the activity
in the full-adders that add these 2 × 2 partial products, the activity in the remainder of
the multiplier remains unchanged. As a result we would expect the power consumption in
the remainder of the multiplier to overshadow any variations that are due to word-level
cross-correlation between the inputs of the multiplier.
On the other hand the word-level lag-1 autocorrelation of a signal affects the activity
of the bits in regions 2 and 3 of the signal. Each bit in the multiplier input is delivered
to each one of the m/2 columns of logic that sum pairs of partial products (where m
is the multiplicand word-length), and each bit of the multiplicand input is delivered to
every one of the n LUTs in the column of logic that generates that multiplicand bit’s
corresponding partial product (where n is the multiplier word-length). As the word-level
lag-1 autocorrelation of a signal directly affects its activity, we can expect the word-level
4.3 Macro-model for LUT Multipliers 133
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cross−correlation coefficient ρ
xy
Po
w
er
 c
on
su
m
pt
io
n 
(m
W
/M
Hz
)
 
 
ρy = −0.9
ρy = −0.5
ρy = 0.0
ρy = 0.5
ρy = 0.9
Figure 4.13: Variation in the dynamic power consumption in a 16 × 16-bit multiplier
while the word-level cross-correlation ρxy and auto-correlation in input y, (i.e. ρy), are
varied. The other signal statistics are held constant: σx = 1, σy = 1, ρx = 0.5.
lag-1 autocorrelation of the multiplier and multiplicand inputs to have a significant effect
on the activity within a multiplier, and hence its power consumption.
Figure 4.13 shows the range in power consumption that occurs due to variations in
the word-level cross-correlation and lag-1 autocorrelation of the input signals to a 16×16-
bit multiplier. The data shown provides experimental evidence that, as stated above,
signal auto-correlation has a significant effect on multiplier power consumption, whilst
cross-correlation between inputs does not. We can take advantage of these observations
when building simple macro-models to estimate multiplier power consumption.
4.3.4 Summary
By using a transition density based model to estimate the activity within LUT-based mul-
tipliers it has been possible to evaluate the effects of input word-length and input signal
activity on the power consumption in multipliers. The model has allowed us to establish
that the word-length of the multiplier input signal is linearly related to power consump-
tion, whilst the word-length of the multiplicand input is quadratically related to power
consumption. We have also established that, of the seven word-level statistical parameters
that describe the bit-level activities and correlations of the input signals, only the lag-1
134 LIST OF TABLES
0 1 2 3 4 5 6 7 8
x 10−3
0
1
2
3
4
5
6
7
8
x 10−3
Multiplier power estimated by XPower (W/MHz)
M
ul
tip
lie
r p
ow
er
 e
sti
m
at
ed
 b
y 
m
ac
ro
−m
od
el
 (W
/M
Hz
)
Figure 4.14: The estimated power consumption in various multipliers of word-lengths
4 to 32 bits, estimated by XPower (x-axis) and the proposed power macro-model (y-
axis). 100 tests involving random input statistics were performed. The dashed lines
show ±10% error margins.
autocorrelations of the signals are necessary to accurately estimate power consumption.
The resulting macro-model for the power consumed in a multiplier is stated in
(4.40), where n and m are the word-lengths of the multiplier and multiplicand inputs,
respectively, C3, C2, C1 and C0 are characterisation coefficients that account for the fixed
capacitance and variable activity of multipliers, where the activity depends on the lag-1
autocorrelations ρx, ρy of the multiplier and multiplicand inputs respectively. As was the
case for the adder, the multiplier characterisation coefficients are determined from a set
of training data.
Pmult = m2C3(ρx, ρy) + nC2(ρx, ρy) +m2nC1(ρx, ρy) + C0(ρx, ρy) (4.40)
Figure 4.14 shows the power consumption estimated by both XPower and the pro-
posed macro-model for multipliers implemented in LUTs of various word-lengths, each
with randomly chosen values for their input signal statistics. It can be seen that the
model is particularly accurate for multipliers with short word-lengths, but there is a mod-
erate amount of error in the estimates made for larger multipliers. Overall the mean
relative error in the estimates made compared to XPower is 11.9%.
4.3 Macro-model for LUT Multipliers 135
0 5 10 15 20 25 30 35
0
0.5
1
1.5
2
2.5
3
x 10−3
Multiplier word−length (bits)
M
ul
tip
lie
r p
ow
er
 c
on
su
m
pt
io
n 
(W
/M
Hz
)
 
 
Multiplicand = 26 bits
Multiplicand = 18 bits
Multiplicand = 10 bits
Multiplicand = 4 bits
Figure 4.15: The variation in power consumption of a multiplier implemented in
LUTs (y-axis) as the multiplier input word-length is changed (x-axis) for four different
multiplicand word-length values: 4, 10, 18 and 26.
The lower accuracy of these models is mainly due to the unpredictable nature of
the place and route algorithms used. As each multiplier of different word-length is routed
differently, and routes that are non-timing critical are not optimised by the router, the
capacitance of intra-routing nets in different multipliers varies significantly. The larger a
multiplier, the more nets there are that have an unpredictable capacitance, and the longer
these nets may be (and hence the greater their capacitance and contribution to power
consumption).
Figure 4.15 highlights the unpredictable nature of place and route by showing the
variation in power consumption for a number of multipliers with the same input signal
statistics. We can see that in general the data plotted agrees with the power consumption
macro-model derived from the multiplier transition density model, i.e. power consumption
is proportional to multiplier input word-length and quadratically related to multiplicand
word-length. However there are unexpected deviations from this due to the variation in
routing capacitance within the different multipliers.
136 LIST OF TABLES
4.4 Macro-model for Embedded Multipliers
Modern FPGAs contain fully custom multiplier implementations in the form of what
are called embedded multiplier components. As shown in Section 3.2.2 each embedded
multiplier in the Xilinx Virtex 2 Pro device implements an 18× 18-bit two’s complement
multiplication in a much smaller area than an equivalent multiplier implemented in LUTs.
Given that embedded multipliers do not contain any configurable routing wires
or configurable logic, we can expect them to also be much more power efficient than
multipliers implemented in LUTs. For multiplications which have at least one input word-
length greater than 18-bits embedded multipliers become less efficient as a combination of
several embedded multipliers and extra addition logic is required.
Unfortunately FPGA manufacturers have not released any details of the internal
construction of the embedded multipliers used, preventing us from building a model of how
activity propagates through these components. However we can expect that the activity
in the partial product generators of these embedded multipliers are sensitive to the same
word-level statistics as LUT multipliers, i.e. the lag-1 autocorrelation of each input signal,
and hence we can characterise a macro-model using these.
The relationship between the input word-length of embedded multipliers and their
power consumption is less clear however. Measurements from XPower indicate that the
power consumption of a single embedded multiplier is low at 15µW/MHz (compared to
an 18 × 18-bit LUT multiplier at 700µW/MHz), when extra embedded multipliers are
used step changes in power consumption occur however, as shown in Figure 4.16.
As no models are available to relate input word-length to embedded multiplier
power consumption, which is particularly important for multiplications larger than 18×18-
bits, fast embedded multiplier power consumption is best estimated via interpolation over
a surface such as that shown in Figure 4.16.
In order to construct a simple macro-model for embedded multipliers, measure-
ments of the multiplier’s power consumption were be taken for a range of combinations of
values of ρx and ρy (i.e. the lag-1 autocorrelations of each input signal). For power con-
4.4 Macro-model for Embedded Multipliers 137
0
10
20
30
40
0
10
20
30
40
0
0.002
0.004
0.006
0.008
0.01
0.012
Multiplier
word−length
Multiplicand
word−length
M
ul
tip
lie
r p
ow
er
 c
on
su
m
pt
io
n 
(W
/M
Hz
)
Figure 4.16: The power consumption of embedded multipliers of input word-lengths
between 4 and 36 bits. Inputs with the same signal statistics are used throughout,
i.e. ‘well scaled’ uncorrelated Gaussian inputs.
sumption estimation during word-length optimisation, some multiplier inputs may arise
with values of ρx and ρy in between of those that have been measured a-priori for the
surfaces described above. In such a situation, a new surface can be created prior to the
start of word-length optimisation by interpolating between the values of the surface that
correspond to the embedded multiplier power consumptions at the closest measured val-
ues of ρx and ρy. These new surfaces can then be used to quickly estimate the power
consumption of embedded multipliers with particular word-level input statistics during
word-length optimisation.
Although the proposed method is very fast as each power consumption estimate
requires only a table-lookup followed by interpolation between the nearest data points
available for the given word-length, the amount of data required to store these models in
memory is high. To ensure that the models can be characterised in a reasonable amount of
time and stored in memory during optimisation it is necessary to limit the number of data
points that are recorded for the model. For input signals 2000 samples long, approximately
3 data points can be measured in one minute, hence we have chosen to measure the power
consumption of embedded multipliers of input word-lengths of multiples of two between
4 and 36 bits, and with input signal lag-1 autocorrelation values ρx and ρy of -0.99, -0.5,
0, 0.5 and 0.99. This gives a total of 16 × 16 × 5 × 5 = 6400 points in the macro-model
138 LIST OF TABLES
0 1 2 3 4 5 6 7 8
x 10−5
0
1
2
3
4
5
6
7
8
x 10−5
Multiplier power estimated by XPower (W/MHz)
M
ul
tip
lie
r p
ow
er
 e
sti
m
at
ed
 b
y 
m
ac
ro
−m
od
el
 (W
/M
Hz
)
Figure 4.17: The estimated power consumption in embedded multipliers of input
word-lengths from 4 to 32 bits, estimated by XPower (x-axis) and the proposed power
macro-model (y-axis). 100 tests involving random input statistics were performed.
The dashed lines show ±10% error margins.
which take a total of 35.6 hours to gather during characterisation using XPower.
The large number of data points gathered ensure an accurate model for a large
range of input word-lengths and signal statistics; tests shown in Figure 4.17 comparing the
power estimated by the proposed method to that calculated using XPower for embedded
multipliers of input word-lengths between 4 and 36 bits with randomly selected input
signal statistics show a mean relative error of 3.1%.
4.5 Components with input signals containing glitches
The work in this Chapter has focussed on estimating the power consumption of com-
ponents whose inputs are glitch-free by using the parameters given by the DBT model.
Unfortunately the estimation of the power consumed in components whose inputs contain
glitches has not been considered in detail during the course of the work conducted for this
thesis, however this section aims to give a brief summary of two potential approaches for
doing so.
Whilst transitions that occur in signals that do not contain glitches can be estimated
from the data travelling on the signal by using the DBT model, transitions due to data in
4.5 Components with input signals containing glitches 139
signals containing glitches can form only a very small fraction of the number of transitions
that occur per clock cycle. Hence a different approach is required to estimate the power
consumed in signals containing glitches.
Two fast methods for estimating power under these circumstances can be envi-
sioned: i) building a simple macro-model that uses the total number of glitches across a
component’s input signals to predict both the component’s power consumption and the
total number of glitches in its output (in order to allow the power consumed in further
un-registered components connected to this one to be estimated), and ii), use of transition
density to estimate the activity in combinational circuits made up of several components.
The first solution described above is attractive due to its simplicity and the single
variable necessary to describe the activity in each input signal to a component. However
only using the total activity in a signal to predict the power consumption and output
activity rates of a component is a significant simplification of the bit-level activities in
signals containing glitches. Figures 4.4 and 4.12 show that the bit-level activity profile
of the output signals of adders and multipliers is complex with different bits showing
large differences in the number of transitions per clock cycle they contain, especially for
multipliers.
If some LSB bits of a signal that contains glitches are truncated it is not clear how
the estimate of the total activity in the truncated signal should be reduced to account for
this if only total activity is known. Because of the lack of information on the bit-level
activities in these signals it can be assumed that any approximation made is likely to be
inaccurate because of the large variation in bit-level activities across these signals. As
components whose inputs contain glitches are likely to be the most power-hungry in a
system it is important that accurate estimates of their power consumption are made so
that they can be correctly targeted during optimisation however, putting this method at
a significant disadvantage.
Using transition density to estimate the activities within combinational circuits
formed of several components could potentially be much more accurate, as the activities of
each bit in the signals of such circuits will be known. However the models for estimating
140 LIST OF TABLES
the activities in adders and multipliers implemented in LUTs that have input signals
containing glitches that were described in this chapter have a computational complexity of
O(n) and O(n2) respectively where n represents the input word-length of the components.
Hence this method would have a higher level of computational complexity compared to the
macro-models proposed in this section whose complexity is independent of word-length.
Additionally it would not be possible to use this method for multiplications implemented
using embedded blocks as no details of their internal construction are available.
Both of the above methods show promise but each has significant disadvantages.
Providing a suitable method for quickly estimating the power consumption of components
whose inputs contain glitches is thus an interesting and unsolved problem that is left for
future work.
4.6 Conclusion
This chapter has presented a set of models that demonstrate that the activity within
adders and multipliers can be predicted accurately using knowledge of the activities and
correlations of the input bits to these components. It has been shown that by using the
DBT activity model it is then possible to generalise the bit-level activities of component
input signals into regions of different activity and as a result the same word-level statistics
that predict the activities of bits in a signal in the DBT model can be applied to predict
the activity within adders and multipliers. Under the assumption that the capacitance
of the logic elements of these components is uniform across a chip we have shown that
their power consumption can then be estimated by using the DBT activity estimation
parameters and closed form functions of word-length that capture the changing trends in
activity as the size of the components change. The resulting methods use fewer parameters
than existing work whilst achieving similar levels of accuracy.
We will conclude this chapter by applying the models developed to estimate the
power consumed within all of the components of an example DSP system. The system
targeted is an infinite impulse response (IIR) filter organised as three sequential second or-
4.6 Conclusion 141
der sections. A block diagram of the system is given in Appendix A. Figure 4.18 compares
the power consumption estimated by the models proposed in this chapter to the values
obtained from XPower. The clock speed of this system was set arbitrarily at 1MHz.
Embedded multipliers have been used wherever multiplication is required.
The results shown indicate a high-level of correlation between the estimates made
by the models described in this chapter and XPower. It is important to note in particular
that the models developed are clearly able to distinguish the components that have higher
power consumption from those with medium to low levels of power consumption. There
is some error in the estimates made, though as we have seen in the preceding sections the
models developed are accurate to within better than 10% on average.
The models developed are significantly faster than XPower, with each model merely
requiring the calculation of at most a second order polynomial in terms of word-length, or
a table lookup (in the case of embedded multipliers). Thus the internal power consumption
of the components in a system can be calculated in a fraction of a second using the models
described. The time required to run a relatively short 6000 sample low-level simulation
on the system used here in order to gather signal activity rates was 5 minutes, whilst
synthesis, placement and routing of the system also took around 5 minutes. This makes
the proposed power estimation models at least several thousand times faster than gathering
power consumption information via switch-level simulation and XPower.
In the following chapter we will deal with the problem of estimating the remaining
dynamic power consumed by a circuit: that consumed in the inter-routing wires that
connect components in a circuit. Afterwards Chapter 6 will describe our method for word-
length optimisation for power consumption minimisation and present power consumption
improvements for a range of test systems. The results obtained in Chapter 6 will also
allow us to examine the success of the trade-off between model speed and accuracy that
we have conducted in this chapter.
142 LIST OF TABLES
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10−5
BiQuad1 AddSub0
BiQuad1 AddSub1
BiQuad1 AddSubB0
BiQuad1 AddSubB1
BiQuad1 DelayA1
BiQuad1 DelayB1
BiQuad1 DelayB2
BiQuad1 FF Delay
BiQuad1 MultA1
BiQuad1 MultA2
BiQuad2 AddSub0
BiQuad2 AddSub1
BiQuad2 AddSubB0
BiQuad2 AddSubB1
BiQuad2 DelayA1
BiQuad2 DelayB1
BiQuad2 DelayB2
BiQuad2 FF Delay
BiQuad2 MultA1
BiQuad2 MultA2
BiQuad3 AddSub0
BiQuad3 AddSub1
BiQuad3 AddSubB0
BiQuad3 AddSubB1
BiQuad3 DelayA1
BiQuad3 DelayB1
BiQuad3 DelayB2
BiQuad3 FF Delay
BiQuad3 MultA1
BiQuad3 MultA2
In Delay
Out Delay
Scale Mult1
Scale Mult2
Scale Mult3
Component’s internal power consumption (W/MHz)
 
 
Measured internal power Estimated internal power
Figure 4.18: The power consumption of the components in the system SOS3 LP.
143
Chapter 5
Models for power consumption in
configurable routing
This chapter describes a method for estimating the power consumed in the routing wires
that connect arithmetic components in an implementation of a DSP circuit in an FPGA.
These inter-routing wires form a significant part of the dynamic power consumed in FP-
GAs, and as the choice of word-lengths for the signals in a system has an effect on both
the number of these wires and the size of the components these wires connect to (and
hence component placement, net length and net capacitance) it is critical that the power
consumed in these wires is accounted for during word-length optimisation for power con-
sumption minimisation.
The modelling of the power consumed in inter-routing wires is significantly different
to the modelling of the power consumed in arithmetic components which was discussed
in Chapter 4. Whilst arithmetic components consist of combinational logic with multiple
inputs that can give rise to glitches within the component, routing wires have one input
and contain no combinational logic, hence the activity in a routing wire is determined
solely by that of its input.
Arithmetic components implemented in FPGAs have a regular structure, due to
the use of the dedicated carry-chains available in the FPGA fabric. In contrast, a routing
wire may use any one of a multitude of routes to connect points in a circuit. Because
144 LIST OF TABLES
of the high flexibility of FPGA routing and the use of stochastic optimisation techniques
such as simulated annealing to place logic and route wires, it is very difficult to predict
the layout of the routing wires in a circuit before routing is performed.
As was explained in Section 2.3.2, (5.1) can be used to estimate the power consumed
in a routing wire that has capacitance C and is being switched at an average number of α
times per clock cycle of frequency f where the voltage swing is V.
P =
α
2
· C · f · V 2 (5.1)
To estimate the total dynamic power consumed in the inter-routing wires of a circuit
using (5.1) the capacitance and average switching activity of each signal in the circuit must
be known. Whilst activity values for a routing wire can be determined from the word-level
statistics of the signals passing through it using techniques such as the DBT method (also
summarised in Section 2.3.2), the capacitance of a wire is difficult to determine at an early
stage because it is dependent on the wire’s layout, the number of LUTs it connects to and
other factors that are described in this chapter.
Unfortunately as discussed in Section 2.2.2 the computational complexity of the
synthesis, mapping, placement, routing and power estimation framework for FPGAs is
such that it would be infeasible to perform any of these steps in the process of deter-
mining capacitance values during word-length optimisation. Hence during word-length
optimisation it is essential to estimate inter-routing wire capacitance without performing
any of the above steps of the FPGA design flow.
The purpose of this chapter is to describe methods for estimating the capacitance
values of inter-routing wires, and to summarise simple methods for quickly combining
these capacitance estimates with signal activity values in order to estimate inter routing
wire power consumption.
The main contributions of this chapter can thus be summarised as follows.
• A novel technique for utilising topological information on a circuit to estimate the
wire-length of its inter-routing wires (published in [CCC07]).
5.1 Capacitance estimation 145
• A method for estimating the fan-out of inter-routing wires using only the word-
lengths and component types used in a system, and for estimating inter-routing
wire capacitance by using fan-out values in combination with wire-length values.
• Fast calculation of the total power consumption in inter-routing wires by combining
the above capacitance estimates with pre-calculated activity values.
Sections 5.1 and 5.2 which follow describe a set of methods for estimating the
capacitance of the inter-routing wires of a circuit, whilst Section 5.3 details a simple
technique for pre-calculating the activity values of inter-routing wires in order to provide
fast estimates of their power consumption. This chapter is then concluded with an analysis
of the accuracy of the proposed technique when estimating the inter-routing power of a
benchmark circuit.
5.1 Capacitance estimation
For capacitance estimation in routing wires, the more information available about the
circuit, the greater the accuracy of the power consumption estimate, but the greater the
time required to obtain this information. Figure 5.1 depicts the information available
about a circuit at the following stages of the design flow: i) before placement, ii) after
placement, and iii) after routing.
Due to the computational cost of placement and routing, it is essential that capac-
itance estimates are made as early as possible in the FPGA design flow, preferably before
either of these steps are executed.
As described in Section 2.3.2 early-stage estimation of the capacitance of FPGA
routing wires has previously been considered in [FWAW05; AN04]. In [FWAW05] the au-
thors propose using fan-out alone after RTL synthesis as an estimate for capacitance, while
in [AN04] the authors consider post-placement estimation of capacitance using several ar-
chitecture specific parameters known after placement. Each method has its drawbacks:
the fan-out only based method from [FWAW05] is likely to have a low accuracy as it is un-
able to account for the placement of components, whilst the more complex post-placement
146 LIST OF TABLES
Mult1 Mult2
Adder2
Adder1
M2
M1A2
A1
M2
M1A2
A1
Placement Routing
i) ii) iii)
Figure 5.1: From left to right, the information available about a circuit that is relevant
to capacitance estimation, at various stages in the FPGA design flow. i) Before
placement component types and inter-component connections are known. ii) After
placement component location, net bounding box, and which LUT pins are used is
known. iii) After routing the wire segments and switch box connections are known.
method in [AN04] is clearly unsuitable for use within word-length optimisation due to it
requiring time consuming placement of a circuit.
Nonetheless these methods provide important insight into the parameters that af-
fect the capacitance of a net in an FPGA and the accuracy that can be achieved when
estimating it. As a result we will study the effect of the parameters in both [FWAW05]
and [AN04] on wire capacitance in the Virtex 2 Pro device in the following section.
5.1.1 Parameters that affect net capacitance
As described in the preceding section, the dependence of net capacitance on Fan-Out (FO),
half perimeter Bounding Box (BB), and Wire-Length (WL), is well known in the research
community. In this section we will examine these parameters in more detail to understand
their contribution to net capacitance, and extract results from an FPGA to indicate the
accuracy that can be achieved with them when estimating capacitance. Figure 5.2 depicts
how these parameters relate to the layout of a wire, as described in greater detail below.
Fan-Out (FO) This is the number of pins (i.e. LUT inputs, etc.) a wire drives, e.g. four
in Figure 5.2. This is the only information available to predict capacitance before
placement is completed.
5.1 Capacitance estimation 147
S
D1
D2
Figure 5.2: A net driven by component S, connected to two points in both components
D1 and D2 (Fan-Out of four). The dotted line is the Bounding Box (BB) of the net.
Wire-Length (WL) is the sum of the lengths of the net’s segments.
Half-perimeter Bounding Box (BB) This is half of the perimeter of an imaginary
box which bounds the wire. In this work, where Xilinx FPGAs have been used,
BB is measured in terms of the number of CLB tiles spanned by each wire, where
each CLB contains a total of eight LUTs and eight Flip-Flops and a switch box
connecting these to the routing fabric, as described in Section 3.2. Once the LUTs
a wire connects to have been placed, BB information is available.
Wire-Length (WL) This is the sum of the lengths of each segment of a wire, known
only once a wire has been routed.
The relationship between the above parameters and the capacitance of a net can
be explained as follows:
• Each LUT input that a net fans-out to has a fixed capacitance associated with it.
As all LUTs in an FPGA are uniformly constructed we can expect that the fan-out
of a net is linearly associated with its capacitance.
• Assuming capacitance per unit distance is constant both horizontally and vertically
through the routing network, then capacitance increases linearly with wire-length.
• Bounding-box is commonly used as a substitute for minimum wire-length at the
placement stage (i.e. when routing has not yet occurred) as it is much more easily
calculated. It is the true minimum wire-length for two and three-terminal nets, whilst
for nets with more terminals it serves as an approximation of minimum wire-length.
148 LIST OF TABLES
In order to study the accuracy that can be achieved when estimating capacitance by
this set of wire parameters, a set of benchmark circuits has been selected and implemented
using Xilinx System Generator. Table 5.1 gives the names of the test systems, a short
description of their behaviour, their size in LUTs and their number of inter-component
routing wires. The first six systems in Table 5.1 are simple filters designed by hand,
while the last four are System Generator example systems. Each of these systems was
implemented on a Virtex II Pro device.
Four capacitance prediction models were created using the parameters discussed
so far as follows: i) a pre-placement model using FO alone as in [FWAW05], ii) a post-
placement model using FO and BB, iii) the linear model (M8) proposed by Anderson and
Najm in [AN04] which uses FO, BB, and some architecture specific parameters1, and iv) a
post-routing model using FO and WL. Each model is a linear function of its parameters,
such as:
C = α · FO + β ·BB (5.2)
which estimates the capacitance C of a wire with fan-out FO and bounding box BB
(i.e. model ii), where α and β are coefficients characterised for the model.
A set of coefficients is characterised for each of the above models using the 3605
inter-routing nets extracted from all of the benchmark systems in Table 5.1. The coef-
ficients of each prediction model are selected so as to minimise the Root Mean Squared
Relative Error (RMSRE) in capacitance over all the inter-routing wires, given by (5.3),
where N is the number of wires, and for wire i, cˆi is the capacitance estimated by the
model, and ci is the capacitance measured using XPower. The RMSRE is minimised by
using weighted least squares regression, where the weight used for each residual is the
square of the measured capacitance.
RMSRE =
√√√√ 1
N
N∑
i=1
(
cˆi − ci
ci
)2
(5.3)
1The architecture specific post-placement parameters used in [AN04] are counts of the: F-LUT load
pins, G-LUT load pins, and CLB tiles containing at least one terminal.
5.1 Capacitance estimation 149
T
ab
le
5.
1:
B
en
ch
m
ar
k
C
ir
cu
it
s
N
am
e
D
es
cr
ip
ti
on
S
L
IC
E
s
C
om
p
on
en
ts
In
te
r-
co
m
p
.
n
et
s
In
tr
a-
co
m
p
.
n
et
s
II
R
3
3r
d
or
de
r
lo
w
pa
ss
II
R
fil
te
r
38
9
19
24
8
69
7
SO
S5
H
ig
h-
pa
ss
II
R
fil
te
r
in
5
se
co
nd
or
de
r
se
ct
io
ns
14
48
72
59
3
33
55
L
M
S2
2n
d
or
de
r
L
M
S
ad
ap
ti
ve
fil
te
r
18
0
15
88
38
1
L
M
S4
4t
h
or
de
r
L
M
S
ad
ap
ti
ve
fil
te
r
16
41
42
60
2
37
87
F
IR
7
Sy
m
m
et
ri
c
7t
h-
or
de
r
lo
w
pa
ss
F
IR
fil
te
r
26
3
15
18
0
45
8
F
IR
31
Sy
m
m
et
ri
c
31
st
-o
rd
er
ba
nd
pa
ss
F
IR
fil
te
r
67
2
50
51
0
13
66
C
ol
or
C
on
ve
rt
er
R
G
B
to
Y
-P
r-
P
b
co
lo
ur
sp
ac
e
co
nv
er
te
r
27
6
26
22
8
41
9
F
ib
on
ac
ci
F
ib
on
ac
ci
se
qu
en
ce
ge
ne
ra
to
r
14
7
13
21
8
12
1
P
ol
yE
va
l
7t
h
or
de
r
po
ly
no
m
ia
l
ev
al
ua
to
r
10
11
31
35
0
24
82
P
ol
yp
ha
se
F
IR
12
8-
ta
p
1:
8
po
ly
ph
as
e
F
IR
fil
te
r
74
1
51
58
8
13
41
150 LIST OF TABLES
FO FO + BB AN FO + WL
0
5
10
15
20
25
30
35
40
45
50
Capacitance model
R
M
S 
Re
la
tiv
e 
Er
ro
r i
n 
ca
pa
ci
ta
nc
e 
(%
)
(a)
FO FO + BB AN FO + WL
0
20
40
60
80
100
120
140
R
el
at
iv
e 
Er
ro
r i
n 
ca
pa
ci
ta
nc
e 
(%
)
Capacitance model
(b)
Figure 5.3: In (a), the RMSRE achieved when the methods shown were fitted to the
capacitance values from all the inter-routing wires extracted from the benchmarks
from Table 5.1 fitted to a Virtex II Pro device. In (b), a box and whisker plot for
each method of estimating capacitance shown in (a), where the whiskers show the
minimum and maximum Relative Errors in inter-routing wire capacitance for each
method, and the boxes show the upper and lower quartiles and median Relative
Errors in inter-routing wire capacitance for each method.
5.1 Capacitance estimation 151
In Figure 5.3(a) the RMSRE for each capacitance estimation model, characterised
to the Virtex II Pro, is shown. It is clear that capacitance has a fairly low dependence on
Fan-Out alone, as the error measured when using this parameter to predict capacitance
is 39.2%. We can see that using FO with the BB or WL parameter gives a significant
improvement in accuracy, with the FO + BB model achieving an error of 25.7% and FO
+ WL achieving 22.6%. The AN model that includes FO, BB and several architecture-
specific parameters1 has the highest accuracy at 17.0%. Figure 5.3(b) shows box and
whisker plots of the Relative Errors in capacitance estimates made by the same methods as
in Figure 5.3(a), and shows that the medians, quartiles and worst case errors in estimated
capacitance for each method closely follow the trends exhibited in Figure 5.3(a).
Although it may seem surprising that wire-length does not provide a more signifi-
cant advantage over using bounding-box, our work in [CCC07] has shown that for Xilinx
FPGAs newer than the Virtex II Pro wire-length gives a much larger benefit for capaci-
tance models and hence FO + WL substantially outperforms the AN model for newer
FPGA devices, as shown in Figure 5.4 for a Virtex 4 device. For earlier FPGAs such as
the Virtex II Pro the capacitance of a wire is more affected by the transistors within the
routing fabric i.e. which paths through switch boxes are used, and which LUT inputs are
driven by a wire. Interconnect capacitance is becoming more affected by the lengths of
metal wires used however, as metal dimensions are not shrinking at the same rate as logic,
in order to avoid the consequent impact on routing delay.
In general these results indicate that knowledge of the Bounding Box of a net
gives significant improvements in the accuracy of capacitance estimates compared to using
Fan-Out alone. Additionally the results shown in Figure 5.4 and [CCC07] indicate that
wire-length (and hence bounding box) has a more significant effect on capacitance in
newer devices such as the Virtex 4 (and others as shown in [CCC07]), and hence using
bounding box as a method for predicting capacitance will continue to be a useful tool
for newer devices. However Bounding Box is only known after placement, which is too
computationally expensive to execute within word-length optimisation. In the following
section a method that we have developed for estimating the bounding box of the inter-
152 LIST OF TABLES
FO FO + BB AN FO + WL
0
10
20
30
40
50
60
Capacitance model
R
M
S 
Re
la
tiv
e 
Er
ro
r i
n 
ca
pa
ci
ta
nc
e 
(%
)
Figure 5.4: The RMSRE achieved when the methods shown were fitted to the capac-
itance values from all the inter routing wires extracted from the benchmarks from
Table 5.1, when the benchmarks were fitted to a Virtex 4 XC4VLX40 device.
component routing wires in a system is described that can be performed quickly using
only information available before placement.
5.1.2 Enhancing early capacitance estimates
In this section we introduce a novel technique for estimating the bounding boxes of inter-
routing wires in a system before RTL synthesis, given a high-level description of the
system. Let us first review the information regarding the circuit that is available to us at
this stage.
1. From high-level descriptions of a design, such as the System Generator block dia-
grams used in this work, the types of blocks used (arithmetic operations, etc.), and
the topology of the circuit, i.e. how these blocks are connected together, are known.
2. During each evaluation of a point in the word-length optimisation design space power
consumption must be estimated for a system that has been assigned a specific set
of word-lengths. These word-lengths determine the number of output inter-routing
wires for each component, and, using the area models of arithmetic components
described in Section 3.2 the word-lengths can also be used to determine the area of
components in the system.
5.1 Capacitance estimation 153
3. As adders and multipliers have a regular construction that is determined by input
word-length, it is also possible to calculate the number of LUTs that each input
of a block fans-out to within the block itself. This is described in more detail in
Section 5.2 which follows. As a result the fan-out of every inter-component routing
wire can be determined quickly using the word-lengths assigned to a system, without
resorting to RTL-synthesis.
Given the information available, bounding box estimates could be obtained by
floorplanning the components, however this is a time consuming operation, while we wish
to perform high speed estimation.
Instead, by solving an approximation to the floorplanning problem and making
some assumptions on the shape of components in the circuit and the connections between
them, we can significantly reduce the computational complexity involved in estimating the
bounding box of inter-component routing wires. This will come at the cost of a drop in
accuracy however, which we shall quantify experimentally in the following section.
First of all, let us introduce several assumptions on the shape of the components in
the system and the connections between them. We will assume that all components are
squares, and that all connections between components depart from and arrive at the centre
of each component. Although these assumptions cause a loss in accuracy they significantly
simplify the floorplanning problem at hand.
Rather than performing true floorplanning where no components can overlap, we
will relax this constraint in our approximation of floorplanning. The removal of this
constraint makes the floorplanning problem considerably easier to solve.
Finally, we will assume that the inputs to a system are located at the bottom left
hand corner on the FPGA, and that outputs are at the opposite corner of the FPGA,
such that all components will be placed between these two points. As we have already
assumed that components are square-shaped with centrally located input/output points
and that components can overlap each other, it becomes clear that with this final assump-
tion on system I/O placement the resulting placement problem is symmetrical along the
154 LIST OF TABLES
line between the system inputs and outputs.
As a result, we can solve a single-dimension version of the two-dimensional floor-
planning problem, under the above assumptions. Overall some accuracy is lost due to
some of the assumptions made, however we cannot expect to estimate the true placement
of components in a circuit without performing placement, instead our aim is to improve
Fan-Out only estimates of capacitance by extracting net Bounding-Box values from the
approximation of the floorplanning problem described above.
The remainder of this section will describe the specifics of an efficient method for
calculating the above described approximation, along with additional detail required to
account for the timing driving nature of the placement problem.
As components are allowed to overlap we can consider them as points in a single
dimension stretched between the input point at position 0 representing all FPGA input
pins in the bottom left hand corner of the chip, and the output point at position C
representing all FPGA output pins in the top right corner of the chip. We then formulate
the following model for optimal placement over this interval.
Each component in our formulation has only one virtual output wire, representing
the connections made by all the output wires that the true component has in the placed
and routed circuit. The length of each wire is measured as the distance between the
left-most and right-most components the wire connects to, meaning that connections to
components between these points are made at no extra cost. The diagram in Figure 5.5
illustrates the way in which a netlist of components is stretched between the input and
output point in the problem formulation.
Two Linear Programs (LPs) are solved in order to optimise the positions of com-
ponents and the wire-lengths in the problem formulation in order to model timing driven
placement. In the first LP the length of the longest wire is minimised. In the second LP
total wire-length is minimised, without exceeding the longest wire-length achieved by the
first LP. After the second LP is solved, the length of each wire is extracted and used as
an estimate of Bounding Box. As wire-length is closely correlated to wire delay, solving
these two LPs allows us to approximate timing driven placement. The formulations can
5.1 Capacitance estimation 155
Mult1 Mult2
Adder2
Adder1
Rough Placement
Adder1
Adder2
Mult1 Mult2
OutputsInputs
i) ii)
Figure 5.5: A simple netlist i) whose placement is approximated by the method
described in this section, resulting in a one-dimensional placement such as the one
depicted in ii).
be summarised as follows.
Given a set of blocks V where each block vj ∈ V drives one net nj =
{vj , va, vb, vc, ...} ⊆ V (i.e. the net nj connects the blocks {vj , va, vb, vc, ...} ⊆ V ), we
define the following variables in both LPs for each block vj :
xj The position of the block,
Lj The length of the block’s output net nj ,
maxj The position of the rightmost block connected to nj ,
minj The position of the leftmost block connected to nj .
Additionally the block representing the input point has position xin = 0 while the block
representing the output point has position xout = C. Any net nj that connects to the
output point has maxj = C. Each output net nj connects to all the components connected
to by all output bits of the component represented by vj .
The value of C is determined for a benchmark using the half-perimeter of the square
whose area is equal to the sum of the area of all blocks in the benchmark, calculated by:
C = 2
√∑
vj∈V
Areaj (5.4)
where Areaj returns the area in CLB tiles of the block vj in the benchmark.
156 LIST OF TABLES
In both LPs the following constraints are defined for each block vj which has output
net nj , in order to correctly calculate each net length Lj :
Lj = maxj −minj
maxj ≥ xi ∀vi ∈ nj
minj ≤ xi ∀vi ∈ nj
}
j = 1, 2, ..., |V | (5.5)
The strength of the 2D to 1D approximation is that we can additionally make use of the
component area information to enforce that the output net nj of each block vj is longer
than a minimum length minWLj , equal to the sum of the widths of the blocks the net
connects to:
Lj ≥ minWLj where minWLj =
∑
vi∈nj
√
Areai (5.6)
under the assumption that each block occupies a square area. Note, however, that this is
a lower bound, and each net nj may be stretched to a longer length than minWLj due to
the stretching of the circuit between 0 and C.
The objective of the first LP is to minimise worst case wire-length:
minLmax where Lmax ≥ Lj , ∀vj ∈ V (5.7)
The worst case wire-length Lmax achieved in the first LP is then used to form the following
constraint in the second LP:
Lj ≤ Lmax ∀vj ∈ V (5.8)
which constrains the worst case wire-length in the second LP to be the same as that
achieved in the first. The second LP then minimises total wire-length by using the objec-
tive:
min
∑
vj∈V
Lj (5.9)
5.1 Capacitance estimation 157
The number of variables in the LP formulations is Θ(|V |) while the number of
constraints in the formulations is Θ(w) where w is the number of net source-to-sink pairs:
w =
∑
vj∈V
(|nj | − 1) so, w < |V |2 (5.10)
As a result our LP formulations are O(|V |2) to solve optimally, on average [Bor87], and
so scale well with problem size.
In the following section the accuracy and computational effort achieved when using
this method to predict capacitance is compared to that achieved when using estimates
made at various stages in the design flow.
5.1.3 Results
This section compares the accuracy and computation times of the method of capacitance
estimation introduced Section 5.1.2 against the models based on previous work that were
described in Section 5.1.1.
During word-length optimisation of a circuit, capacitance estimates will be required
for many versions of the same circuit, where each version has been assigned a different set
of word-lengths. It is important to note that with even slight differences in the netlists
passed to place and route significant variations in net capacitance can occur because, as has
been mentioned earlier, these steps of the FPGA design flow utilise stochastic optimisation
techniques such as simulated annealing that behave differently for every netlist. Indeed in
[AN04] the authors demonstrated that simply by reversing the order of nets in a netlist,
the nature of the Xilinx FPGA place and route tools caused an average of 22% variation
in net capacitance.
In the process of evaluating the performance of the capacitance model developed it
would be preferable to be able to account for these inherent variations in capacitance from
placement to placement. This has been achieved by generating five alternative placements
of each of the benchmarks in Table 5.1. The capacitance estimation models that are
compared below are characterised to the inter-routing wires extracted from one of the five
158 LIST OF TABLES
placements of all of the benchmark systems.
To account for the unpredictable nature of place and route across slightly different
versions of the same circuit, the characterised models are used to predict the capacitance
in the remaining four versions of each system. These four alternative versions of each
system will allow us to model the variation in capacitance due to place and route that
occurs for every change in word-lengths made to a system. By using alternative versions
of each circuit, each using the same set word-lengths, we are able to isolate the variations
due to place and route alone.
We compare the capacitance estimation models below:
FO A linear model using Fan-Out alone as in [FWAW05].
FO + est. BB A linear model using Fan-Out and the Bounding Box estimate for each
component’s output, given by the proposed method.
FO + BB A linear model using Fan-Out and the BB value of each inter-routing, mea-
sured post-placement. As the proposed approach attempts to estimate BB this will
provide a bound on the accuracy achievable by the proposed method.
AN Anderson & Najm’s post-placement model (M8) [AN04].
PAR The capacitance values from XPower extracted from the placement using the first
random seed.
Figure 5.6(a) summarises the accuracy achieved by each of these estimation meth-
ods. We can see that the proposed method reduces the error of capacitance estimates
by 3.1% over using a Fan-Out alone capacitance model. There is room for improvement
between the proposed method and the FO + BB model, which gives the accuracy we
could expect to achieve if perfect estimates of bounding-box were possible, however it is
clear that without performing placement getting significantly closer to this level of accu-
racy would be difficult. The post-placement estimate given by AN and the post route
estimates given by PAR give significant improvements in accuracy, with the post-route
estimates giving average errors of 19.3%. The error in the post-route estimates indicate
5.1 Capacitance estimation 159
FO FO+est. BB FO+BB AN PAR
0
5
10
15
20
25
30
35
40
45
50
Capacitance model
R
M
S 
Re
la
tiv
e 
Er
ro
r w
he
n 
es
tim
at
in
g 
   
   
ca
pa
ci
ta
nc
e 
in
 a
lte
rn
at
iv
e 
pl
ac
em
en
ts 
(%
)
(a)
FO FO + est. BB FO + av. BB AN PAR
0
50
100
150
200
250
300
R
el
at
iv
e 
Er
ro
r w
he
n 
es
tim
at
in
g
ca
pa
ci
ta
nc
e 
in
 a
lte
rn
at
iv
e 
pl
ac
em
en
ts 
(%
)
Capacitance model
(b)
Figure 5.6: In (a), the RMSRE in capacitance for the inter-routing wires from Ta-
ble 5.1, averaged across 4 alternative placements of each circuit, using the estimation
techniques listed above. In (b), a box and whisker plot for each method of estimating
capacitance shown in (a) each plot showing the Relative Error in capacitance for the
inter-routing wires from Table 5.1 averaged across 4 alternative placements of each
circuit, where the whiskers show the minimum and maximum Relative Errors and
the boxes show the upper and lower quartiles and median Relative Errors for each
method.
160 LIST OF TABLES
FO FO+est. BB    FO+BB AN PAR
0
0.2
0.4
0.6
0.8
1
Capacitance model
Sp
ea
rm
an
’s
 ra
nk
 c
or
re
la
tio
n 
co
ef
fic
ie
nt
 a
ve
ra
ge
d
o
v
er
 e
ac
h 
pl
ac
em
en
t o
f e
ac
h 
be
nc
hm
ar
k
8.4e−3
1.3e−5
7.2e−11
3.6e−21
Co
rre
sp
on
di
ng
 P
−v
al
ue
s
Figure 5.7: The Spearman’s rank correlation coefficients between the capacitances of
the wires in each benchmark and the estimates made by the models shown. The corre-
sponding P-values indicate the probability of achieving a particular level of correlation
using a randomly selected ordering of nets.
the inherent variability in capacitance of a circuit: after making any high-level change to
a circuit, the capacitance of the circuit’s inter-routing wires will vary by this amount, on
average.
Figure 5.6(b) shows a box and whisker plot of the relative errors in capacitance
estimates made by each method shown in Figure 5.6(a). In Figure 5.6(b) it can be seen
that the medians and quartiles of the relative errors of each method show similar trends
to the RMSRE values seen in Figure 5.6(a). It can also been that some large errors occur
in the upper quartile for every technique including post route estimates, indicating that
the capacitance of some nets is greatly affected by the variability due to place and route.
Despite the modest improvement in the RMSRE in capacitance, the method also
allows much more accurate identification of those inter-routing wires in a benchmark which
have the highest capacitance. In Figure 5.7, we have used Spearman’s rank correlation
coefficient to measure the similarity of the ordering of net capacitance estimated by each
model compared to the ordering in the four alternative placements of each benchmark. A
correlation coefficient of +1 indicates that two rankings are identical, while a coefficient
of 0 indicates that two rankings are completely different.
We can see that the rank correlation coefficient has been improved significantly by
5.2 Fan-out estimation 161
almost 0.3 over the FO model and gives rank orderings of a similar level of accuracy to
those given by the more computationally expensive models. Clearly the proposed method
can more accurately identify high capacitance nets at an early stage, allowing these to be
targeted during word-length optimisation.
Further improvements in accuracy can be achieved by using the post-placement
model AN, or by routing a circuit as in PAR, however these methods are associated with
much longer computation times, as shall now be demonstrated.
For one of the medium-sized benchmarks studied, PolyphaseFIR, the proposed
method took 0.58 seconds to calculate capacitance estimates, while generation, synthesis,
mapping and placement, required to use the method AN, took 161s. Routing took a
further 795s, and is needed to use the PAR model. For this system, the proposed ca-
pacitance estimation method is 300 times faster than a post-placement method and 2000
times faster than a post-routing method. The proposed method successfully trades off a
reduction in accuracy for a large reduction in the computational complexity by calculating
capacitance estimates before RTL-synthesis.
5.2 Fan-out estimation
The capacitance model described in Section 5.1.2 demonstrates a method for estimating
the bounding box for each wire in the system using a ‘rough placement’ of the system, and
for using these values in conjunction with the fan-out values known for each wire. Whilst
the method for performing a ‘rough placement’ of a system was described in Section 5.1.2,
it was assumed that due to the regular construction of arithmetic components wire fan-
out values can be easily determined by using the word-length of components assigned to
a system. In this section we will detail how this is possible.
Figure 5.8 indicates the approach taken to estimate the fan-out of a wire: the
component (or components) that the wire connects to are examined individually and the
number of points that the wire fans-out to within each component is then determined,
using knowledge of the component’s internal construction, and its input word-lengths.
162 LIST OF TABLES
Mult1
Mult2
Adder2
Adder1
Figure 5.8: The diagram depicts a wire that fans-out from Adder1 to three other
components. The number of fan-outs for each bit of the wire that are within Adder2,
Mult1 and Mult2 can be determined for each component individually according the
structure of the component and its input word-lengths, as is demonstrated in this
section.
The construction of the arithmetic components in question is such that the input
fan-outs of a component are very easily determined, as detailed in the following sections
for adders and multipliers implemented in both LUTs and embedded multipliers. Once the
determination of the fan-out for the inputs to individual components has been explained,
the calculation of the fan-out to multiple components will be dealt with in Section 5.2.4.
5.2.1 Input fan-out for adders
From the construction of adders, as described in Section 3.2.1, we can determine that each
input bit to an adder is used in only one full-adder, in order to compute the result of an
addition. However, in some cases the MSB of one of the inputs to an adder may need to
be sign-extended, due to different scalings of the adder’s inputs. In this case the MSB-bit
of the signal with smaller scaling will be sign extended by pl− ps bits, where pl and ps are
the scalings of the input signal with larger and smaller scaling, respectively.
5.2.2 Input fan-out for multipliers implemented in LUTs
When multiplying a pair of two’s complement numbers, each bit of the multiplicand input
is multiplied by each bit of the multiplier input. Thus if the number of multiplicand bits
increases, the fan-out of the multiplier bits increases, and vice-versa.
5.2 Fan-out estimation 163
If the word-lengths of the multiplier and multiplicand are n and m bits, respectively,
each multiplicand bit is multiplied by n multiplier bits, and hence each multiplier bit is
multiplied by m multiplicand bits. Hence the fan-out of each multiplier bit is the number
of multiplicand bits m, and the fan-out of each multiplicand bit is the number of multiplier
bits n.
5.2.3 Input fan-out for embedded multipliers
For embedded multipliers the situation is similar to multipliers implemented in LUTs,
however every configurable routing wire fans-out to one pin on the embedded multiplier
block, which is then connected internally (within the multiplier block) to the embedded
multiplier logic.
Hence for multipliers of input word-lengths of up to 18 × 18-bits, each input has
only one fan-out. For larger multiplications, the number of embedded multiplier blocks
can be calculated as dn/18e× dm/18e, where n is the multiplier word-length and m is the
multiplicand word-length. Extra embedded multiplier blocks due to the multiplier word-
length n cause the fan-out of the multiplicand signal to increase, whilst extra embedded
multiplier blocks due to the multiplicand word-length m cause the fan-out of the multiplier
signal to increase.
Thus the fan-out of the multiplier inputs to an embedded multiplier are given by
dm/18e, whilst the fan-out of the multiplicand inputs to an embedded multiplier are given
by dn/18e
5.2.4 Fan-out to multiple components
In multiple word-length systems such as those generated during word-length optimisation,
different word-lengths can be assigned to each branch of a signal that branches out to
several components as shown in Figure 5.9. Additional care is needed to calculate the
fan-out of each bit of such signals correctly.
If we sort the branches of a multiple-branch signal in increasing order of word-length
164 LIST OF TABLES
MSB
LSB
MSB
MSB
MSB
Figure 5.9: A component with an output that branches out to three other components
is shown. Different input word-lengths are used for each component, i.e. from top
to bottom word-lengths of 3, 2, and 1, respectively. According to the input word-
lengths used for each component, different output bits will branch out to different
components, which must be accounted for correctly as described in this section.
and designate these word-lengths w1, w2, ..., wb, where b is the number of branches, then
bits 1 to w1 of the signal (where bit 1 is the MSB) fan-out to all components connected to
by the multiple-branch signal. The bits w1 to w2 of the signal connect to all components
connected to by the multiple-branch signal, except the component that word-length w1
is associated with. Similarly the bits w2 to w3 connect to all components, except the
component that word-lengths w1 and w2 are associated with, and so on for all other bits
up to wb. The fan-out of output bits that branch out to multiple components can be
determined by summing the input fan-outs of each component connected to, where the
input fan-outs are calculated according to the arithmetic component type as described in
the preceding sections.
The fan-out of each bit of the inter-routing wires in a system can thus be determined
quickly using the word-lengths assigned to the system, allowing accurate capacitance esti-
mates to be made by using these fan-out values with the bounding box estimates provided
by the ‘rough placement’ method described in Section 5.1.2.
5.3 Activity estimation 165
5.3 Activity estimation
The capacitance in routing wires forms only one part of the information required to esti-
mate the power consumed in those wires because, as was shown at the beginning of this
chapter, the dynamic power consumed in these wires is also determined by the activity of
the signals passing through them.
Fortunately, as we have assumed that the outputs of all components in the systems
we are considering for word-length optimisation will be registered, estimating the activity
in the inter-routing wires in these systems is straightforward. The word-level statistics of
component input signals that are used in Chapter 4 to estimate the power consumption
within components also allow the activity within those input signals to be predicted using
the DBT activity model, as is described in Section 4.1.
The most important thing to note about activity estimation for the inter-component
routing wires in a system is that the activity in all bits in a signal can be determined
once and for all, before optimisation, from word-level statistics gathered during high-level
simulation of a system. Assuming that the quantisation noise introduced into a system
during the process of word-length optimisation is small, the pre-determined activity in
each bit of the inter-component signals will not change due to the word-lengths selected
for each signal. Clearly, the word-length used for each signal affects which bits are used
in the signal, but it does not affect the activity of any of these bits.
The result of this is that the activity of every inter-routing bit that may potentially
be used in a system can be calculated once only, before word-length optimisation, and
stored in memory. During the process of word-length optimisation, the power consumed
in inter-routing wires can then be quickly calculated by multiplying the pre-computed
activity values of each signal by the capacitance of each signal determined using the method
described in Section 5.1.2.
166 LIST OF TABLES
5.4 Summary
This chapter has described a set of methods that, in conjunction, allow the power con-
sumption in the inter-component routing wires of a system to be estimated. By calculating
the fan-out of these wires and estimating their bounding-box using a technique for ap-
proximating placement, the capacitance in each wire can be determined, whilst activity is
determined using the word-level statistics used for activity estimation in the DBT activity
model.
The dependence of net capacitance on wire-length and bounding-box was shown for
both the Virtex II Pro device and the new Virtex 4 device, and as shown in [CCC07] these
results indicate that net capacitance in newer devices such as the Virtex 4 has a higher
dependence on wire-length. Hence the proposed bounding-box estimation technique will
be a useful tool for estimating inter-routing wire capacitance in both current and future
FPGAs.
The speed of the models developed compared to more accurate methods that require
placement or even routing of a circuit to estimate capacitance has been demonstrated, and
as the method developed for capacitance estimation alone is 2000 times faster than place
and route, the benefit is clear. Although the lower accuracy of the capacitance models has
been documented, the accuracy of the combined power consumption models has not been
stated. In fact however we expect little additional error to be introduced due to accurate
fan-out calculations and the close correlation between the activity of signals and those
activities predicted by the DBT model.
Figure 5.10 compares estimates and XPower measurements for the inter-component
routing power consumed by the output of each component in the same IIR system organ-
ised in second order sections as was introduced in Section 4.6. A block diagram of this
system is included in Appendix A.
In Figure 5.10 it can be seen that the high-power consumption nets that branch
out to several components are clearly identified by the estimation method developed. The
accuracy of the power consumption estimates made for these nets is high. On the other
5.4 Summary 167
0 1 2 3 4 5 6 7
x 10−5
BiQuad1 AddSub0
BiQuad1 AddSub1
BiQuad1 AddSubB0
BiQuad1 AddSubB1
BiQuad1 DelayA1
BiQuad1 DelayB1
BiQuad1 DelayB2
BiQuad1 FF Delay
BiQuad1 MultA1
BiQuad1 MultA2
BiQuad2 AddSub0
BiQuad2 AddSub1
BiQuad2 AddSubB0
BiQuad2 AddSubB1
BiQuad2 DelayA1
BiQuad2 DelayB1
BiQuad2 DelayB2
BiQuad2 FF Delay
BiQuad2 MultA1
BiQuad2 MultA2
BiQuad3 AddSub0
BiQuad3 AddSub1
BiQuad3 AddSubB0
BiQuad3 AddSubB1
BiQuad3 DelayA1
BiQuad3 DelayB1
BiQuad3 DelayB2
BiQuad3 FF Delay
BiQuad3 MultA1
BiQuad3 MultA2
In Delay
Out Delay
Scale Mult1
Scale Mult2
Scale Mult3
Inter−routing power consumption of component’s output (W/MHz)
 
 
Measured inter−routing power Estimated inter−routing power
Figure 5.10: The inter-routing power consumption of the components in the system
SOS3 LP.
168 LIST OF TABLES
hand however we can see that there is a lower level of correlation between the power
consumption estimates made for nets for which lower power consumption was observed.
The reason why this occurs has been explained in more detail in the preceding sections
of this chapter but is once again summarised below to highlight the difficulties faced in
placement estimation in this chapter.
The capacitance of the routing wires in a system is inherently unpredictable, due to
the use of stochastic optimisation techniques to decide on the placement and routing of a
circuit. The variation in routing wires manifests itself at a particularly high level in short,
low-capacitance nets in the circuits studied because these nets are not timing critical, in
comparison to the longer nets in the systems studied that have high fan-out and hence
higher capacitance and longer delay. Because the proposed method does not perform
true placement before making capacitance estimates, it is unable to reflect the fine-grain
decisions that are made regarding the placement of the smaller components (such as adders
and delay registers) whose output is connected to only one other component.
Although this results in a higher percentage error in the capacitance estimates made
for the single branch single fan-out components in a system, this should not significantly
affect the results achieved during word-length optimisation. The reason for this is that
word-length optimisation will attempt to target high-power consumption points in a sys-
tem by reducing the word-lengths of the signals associated with these points. The rank
ordering results for capacitance estimation that were given in Section 5.1.3 demonstrate
that we are able to distinguish those parts of a circuit that have high-power consump-
tion accurately, and as such the methods that have been developed should be sufficient
to ensure appropriate signals are truncated in order to reduce power-consumption during
word-length optimisation.
In the following chapter we present the method developed for performing word-
length optimisation for power consumption along with results from its application to a
number of benchmark systems. We will also compare word-length optimisation for power
consumption minimisation to word-length optimisation techniques that minimise other
costs of a system, and as shall be seen the results from these comparisons will be used to
5.4 Summary 169
shed further light on the accuracy required from the models used to estimate the cost of
a system, such as those used to estimate dynamic power consumption presented in this
chapter and in Chapter 4.
170 LIST OF TABLES
171
Chapter 6
Word-length selection using
constrained optimisation
This chapter describes a word-length selection technique that finds optimal solutions to
non integer word-length optimisation problems for. We focus in this chapter on optimis-
ing algorithms whose quantisation noise can be estimated using the perturbation analysis
method summarised in Section 2.1.2 and whose area or power consumption can be esti-
mated using the models described in Chapters 3, 4 and 5. The proposed method provides
tight lower and upper bounds to the optimal integer solution to a word-length optimisa-
tion problem, with the upper bound providing a starting point for heuristics to find the
better integer solutions.
The advantage of the proposed technique is that these lower and upper bounds
on the optimum integer cost can be established quickly, whilst finding the optimal integer
solution to a word-length optimisation problem is known to be NP-hard (see Section 2.1.1).
Existing heuristic methods cannot guarantee to find the optimal solution and provide no
way of assessing how a ‘good’ a result is. In contrast the proposed method allows the
optimality of any particular integer solution to be assessed.
In the following sections the proposed word-length selection technique summarised
above is described in detail. Afterwards Section 6.3 compares the results of optimising
for different cost functions (area, power, etc.) and provides an analysis of the method’s
172 LIST OF TABLES
performance.
The main contributions of the work presented in this chapter are as follows
• A technique for solving a relaxed version of the word-length optimisation problem
where integer word-lengths are not enforced, using convex optimisation techniques.
• A method for using the result of this “non integer” word-length optimisation problem
to establish tight bounds on the optimal solution of the equivalent integer problem.
• The introduction of constraints and modifications to the word-length optimisation
problem in order to ensure globally minimum solutions are found using convex op-
timisation.
• The first results that allow for detailed analysis and comparison of multiple word-
length optimisation techniques for uniform word-lengths, sum of word-lengths, sys-
tem area and system power consumption.
A journal paper based on the work described in this chapter has been submitted to the
ACM Transactions on Design Automation of Electronic Systems in [CCC08].
6.1 Sequential Quadratic Programming
Sequential Quadratic Programming (SQP) is an optimisation method for finding a mini-
mum of a function f(x) as expressed in (6.1), where x is a vector and g(x) and h(x) are
constraint functions that may return a vector result. The method is most useful when at
least one of either f(x), g(x) or h(x) is non-linear, as otherwise the problem in (6.1) may
be better solved using other specialised techniques for linear programming. A good review
of SQP is given in [Bog95], and is summarised below.
minimise f(x)
subject to: g(x) = 0
h(x) ≤ 0
(6.1)
6.1 Sequential Quadratic Programming 173
SQP is an iterative approach to finding a minimum x* of the function f(x), where
the constraint functions are satisfied. At each iteration the Lagrangian function of the
optimisation problem (6.1) is approximated as a quadratic programming problem at the
current point xk. The solution of this quadratic program is used to move to a point xk+1,
closer to x*. The Lagrangian function L(x) is frequently used in constrained optimisation
to account for both the cost and constraint functions by penalising unsatisfied constraints;
further detail can be found in [Bog95].
In order to construct the quadratic subproblem at each iteration the first and
second order partial derivatives L′(x) and L′′(x) of the Lagrangian at the current point xk
must be known. Calculating the second order partial derivatives (i.e. the Hessian) of the
Lagrangian is usually computationally expensive, so instead an estimate of the Hessian is
updated after each SQP iteration using the first order partial derivatives (i.e. the gradients)
of f(x), g(x) and h(x) with respect to x, by using the BFGS formula summarised in
[Bog95]. Calculating the first-order derivatives of the Lagrangian L′(x) also requires the
first-order partial derivatives of the cost and constraint functions to be known, which can
be accomplished either by deriving the analytic equations to calculate these, or by using
the finite difference method on the cost and constraint functions to measure their gradient
at xk.
At each iteration, once the quadratic subproblem has been solved a direction along
which to proceed is known, the distance along which to travel is determined by a merit
function that ensures global convergence of the SQP algorithm.
Given a starting point x0 SQP will find a minimum value of f(x) that satisfies the
given constraints. If either f(x) or the constraint functions are not convex the minimum
found by SQP may only be a local minimum, however if they are convex the global
minimum of the problem will be found.
Word-length optimisation problems are not solvable in their standard form by SQP,
for a variety of reasons that are listed in the following section. It has however been
possible to slightly modify the word-length optimisation problem in order to allow an
approximation of it to be solved quickly and optimally by SQP. These modifications are
174 LIST OF TABLES
also detailed in the following section and as will be seen they only marginally affect the
optimal solution found whilst allowing bounds on the minimum cost function value to be
calculated.
6.2 Word-length optimisation modifications for SQP
To use SQP to find solutions for word-length optimisation we must formulate our problem
as shown in (6.1) so that x is a vector representing the signal word-lengths in the system,
f(x) will be a cost function that is to be minimised, e.g. area or power, and h(x) ≤ 0 will
be used to express a constraint on the noise variance at the output of the circuit. If the
circuit has multiple outputs there will be one noise constraint for each output, however
without loss of generality we assume for the remainder of this chapter that there is only
one system output and hence only one noise constraint. Later we will see that further non-
linear and some linear inequality constraints will be needed to allow correct and optimal
solutions to the word-length optimisation problem to be found. For now however let us
assume that h(x) ≤ 0 will represent the noise constraint on the system. No equality
constraints (of the form g(x) = 0) are required for word-length optimisation.
It is clear from the models presented in previous chapters that area and power
consumption are non-linear in terms of x. This is due to the multipliers in a system,
whose area and power is quadratically related to word-length when LUTs are used, and
highly non-linear when embedded multipliers are used as shown in Chapters 3 (for area)
and 4 and 5 (for power). Additionally, the routing power model described in Chapter 5
uses capacitance values that are returned by a linear program whose behaviour as word-
lengths change is likely to be non-linear. Finally the truncation noise function is clearly
non-linear as it is a sum of exponential terms of the form shown in Section 2.1.2.
Although the non-linear nature of these functions indicates that SQP may be a
suitable candidate for solving word-length optimisation, as mentioned earlier there are
several reasons why this is not possible. These reasons are listed below and dealt with in
the following subsections.
6.2 Word-length optimisation modifications for SQP 175
A) Word-length optimisation is an integer variable problem. SQP however
deals only with real variables. Neither the steps made in x at each iteration or the
solution found are necessarily integral, but word-lengths must be integer values.
B) The noise function is not convex. Correlation between truncation noise injected
into the fan out branches of a signal that fans out to multiple destinations can cause
non-convexity when the word-lengths of the fan out branches change, as explained
in [CCL02].
C) Preventing zero-padding of signals. If more bits of precision are chosen to rep-
resent a signal than are available from the preceding operation then extra bits are
zero padded. These add no accuracy to a signal and require unnecessary effort to
keep track of.
D) Routing power convexity. The capacitance estimates used when estimating inter-
component routing power are calculated using a linear program whose behaviour as
word-lengths change is non-convex.
6.2.1 Relaxation of integer constraints
At first glance SQP would seem to be an unsuitable optimisation method for word-length
optimisation as the word-lengths of the signals in a system must be integer values. However
as detailed in Section 2.1.1 optimisation problems involving integer variables are typically
NP-hard and word-length optimisation is no exception [CW02]. As a result only the
smallest of word-length optimisation problems have been solved to optimality [CCL02].
The remainder of this section concentrates on efforts to make the word-length
optimisation problem optimally solvable when the constraint on integer word-lengths is
relaxed (we will call this the non integer problem from now on).
The fact that we may be able to optimally solve a non integer version of an integer
variable problem is not necessarily useful, but in the case of word-length optimisation this
does turn out to be the case. This is due, firstly, to the fact that the cost of the optimal
solution to the non integer word-length optimisation problem gives a lower-bound on the
176 LIST OF TABLES
cost that can be achieved by the integer problem.
Although it is possible that an integer solution is the optimal point for both the
integer and non integer word-length optimisation problems, it is far more likely that some
additional improvements in the cost function can be achieved over the integer solution
by taking advantage of the possibility of using fractional word-lengths. As the integer
version of word-length optimisation cannot be solved optimally for any but the smallest
of systems, having a lower bound on the cost of the problem is very useful as it allows the
effectiveness of heuristics for finding near-optimal solutions to the integer problem to be
studied.
Once an optimal solution to the non integer word-length optimisation problem has
been found an upper bound on the optimal cost of the integer version of the problem can
be obtained by rounding up the word-lengths to the nearest integer. This upper bound
will give a solution with increased cost but with better error performance, i.e. it produces
a feasible integer word-length solution. When obtaining the upper bound to the integer
word-length problem from the optimal non integer word-length point each word-length
will increase by an average of 0.5 and a maximum of 1 in the worst case, which is likely to
lead to a relatively small increase in the cost of the system. This is the second reason why
finding the optimal solution of the non integer word-length optimisation problem is useful:
because we are able to quickly find an integer solution to the problem that lies close to the
optimal point, and that solution gives an upper bound to the cost of the integer version
of the problem.
The upper bound word-lengths can then be treated as a starting point for heuristics
for solving the integer version of the word-length optimisation problem. A variety of
heuristics have been presented in previous work, but as word-length optimisation is NP-
hard none of them are guaranteed to find an optimal solution. For this work where lower
and upper-bounds on the cost function are established prior to the execution of a heuristic,
we envision the use of heuristics that attempt to continue to reduce the cost of a circuit
until this becomes acceptably close to the known lower bound, though these have as yet not
been developed in the course of this work. Currently the heuristic described in [CCL01]
6.2 Word-length optimisation modifications for SQP 177
is used, which iteratively reduces word-lengths as follows:
1. Starting from the current assigned word-lengths, take each signal in turn and grad-
ually reduce its word-length until just before noise constraints are broken, at which
point measure the cost of the system before returning the signal’s word-length to
the value assigned to it at the start of this iteration.
2. Once all signals have been tested, choose the signal whose word-length reduction
resulted in the largest measured decrease in cost and reduce its word-length perma-
nently by one bit, then return to (1) and start a new iteration.
3. Terminate when no further word-length reductions are possible without breaking
signal to noise ratio constraints
Solving the word-length optimisation problem with relaxed integer constraints
clearly provides significant benefits when trying to find the optimal solution of the in-
teger version of the problem. However optimal solutions to the non integer word-length
optimisation problem are not directly found by SQP, as was described earlier. The fol-
lowing subsections will approach the convexity issues that must be solved to ensure global
optima are found.
6.2.2 Simplification of noise model to convex form
Where a signal fans-out to multiple components one word-length variable is allocated
for each fan-out branch of the signal, allowing individual control over the cost of each
component connected to by the signal. This means that each fan-out branch can be a
different truncated version of the original signal, and as a result branches that share some
number of truncated bits share the same truncation noise contribution from the removal
of these bits. Any noise injected into several fan-out branches due to some shared set
of truncated bits is correlated, i.e. it is the same noise signal. When this correlated
noise re-converges in later operations of an algorithm it may either grow larger or cancel
out, depending on the intervening operations performed on the correlated noise signals.
Accounting for truncation noise correlation is thus an important part of noise modelling.
178 LIST OF TABLES
Q Q Qwl0 wl1 wl2 wl3
N1 N2 N3
H1 + H2 + H3 H2 + H3 H3
(a)
Q
Q
Q
wl0 wl1
wl2
wl3
N1
N2
N3
H1 + H2 + H3 H3
H2
(b)
Figure 6.1: Noise injection schemes for truncating a signal with three fan-out branches
from word-length wl0 to word-lengths wl1, wl2 and wl3. The quantiser blocks Q inject
noise signals N1, N2 and N3 into the branches of the signal as a result of truncation.
These noise signals pass through combinations of the transfer functions H1, H2 and
H3, to the system output. Scheme (a) correctly calculates the noise injected when
wl1 > wl2 > wl3 but non-convexity can occur if this order of word-lengths changes,
as the destination transfer functions also change. Scheme (b) only accounts for the
truncation noise common to H1, H2 and H3, but the optimal ordering of word-lengths
can be determined to make the scheme convex with the enforcement of appropriate
constraints as described below.
The noise model in [CCL02] correctly models the correlated noise injected into fan-
out branches by using the noise injection scheme depicted in Figure 6.1(a). In this scheme,
the fan-out branches of a multiple fan-out signal are ordered from longest to shortest word-
length with each branch ‘tapped off’ from the original source in this order. Noise is injected
before each tap in order to model truncation from the word-length wn before the tap to
the next (shorter) word-length wn+1 after the tap. Note that the noise injected due to
truncation from the original signal word-length to the longest word-length is common to all
subsequent fan-out branches, and that subsequent injected noise is common to a shrinking
subset of the fan-out branches.
If a change in word-length results in a re-ordering of the fan-out branches (because
these must be kept in decreasing order of word-length) then the subsets of branches into
which noise is injected changes. The new subsets of branches (after re-ordering) cause
a different response at the system output to the truncation noise injected into each tap.
Hence a decrease in word-length that causes a re-ordering of fan-out branches can cause
a decrease in noise at the system output; this makes the system output noise non-convex
6.2 Word-length optimisation modifications for SQP 179
for certain changes in fan-out branch word-length.
Unfortunately there is no known method to determine the ‘ideal’ order of word-
lengths in the branches of a multiple output signal before or during optimisation. A better
ordering of fan-out word-lengths than the current one in use may always exist unless a
search of the entire word-length optimisation space is performed.
In this work we have simplified the noise model by not accounting for the most
part for the effects of correlated noise injected into fan-out branches. Instead, we only
account for the correlated noise that is common to all the branches in a multiple fan-out
signal. Noise due to truncations longer than the one common across all fan-out branches
is assumed to be uncorrelated.
The noise injection scheme corresponding to this simplified noise model is depicted
in Figure 6.1(b). Under the proposed scheme, correlated noise corresponding to that
created when truncating to the longest word-length branch is injected into all branches
of a multiple fan-out signal. Every branch whose word-length is shorter than the longest
word-length is injected with extra uncorrelated noise that accounts for truncation from
the longest word-length to the shorter word-length in that branch.
This scheme correctly models correlated truncation noise for multiple output signals
with two fan-out branches. As the number of fan-out branches increases past two, the
proposed model becomes less accurate. The advantage of the proposed model however is
that with appropriate constraints it can be made convex and so allow optimal solutions
to be found by SQP.
Two states can be identified for the word-length of each branch of a multiple fan-out
signal under the proposed noise injection scheme: for each branch either i) the word-length
of the branch is the longest in the fan-out, or, ii) the word-length of the branch is not
the longest word-length in the fan-out. For the same reasons as the noise model used in
[CCL02], non-convexity can occur when any word-length changes from state i) to state
ii), or vice versa.
However, in the case of the simplified model it is possible to determine the order of
180 LIST OF TABLES
word-lengths such that the optimal solution is always found. This is done by constraining
the word-length belonging to the fan-out branch that is most sensitive to truncation noise
to be larger than all other word-lengths in the fan-out. Shorter word-lengths can then be
assigned to the other branches that are less sensitive to truncation noise, giving a lower
system cost than could be achieved using other configurations.
To form these constraints a system that has m multiple fan-out signals formed from
a total of b individual fan-out branches requires b−m linear inequality constraints to be
included in the formulation of the word-length optimisation problem.
Whilst the optimum ordering of branches can be determined for the simplified
model, in contrast for the noise model in [CCL02] a signal with n fan-outs has n! possible
orderings of those fan-out branches, with each possible ordering giving a different noise
response to the truncations occurring in the taps between branches. Only one of the
orderings of fan-out branches will yield the global optimum however. Clearly, for any
circuit that contains a signal that fans-out to more than a few components the number of
possible branch orderings becomes very computationally expensive to explore.
In conclusion, the proposed noise model will give less accurate estimates of trun-
cation noise in high fan-out signals than the model in [CCL02] but allows the optimum
configuration of word-lengths for multiple fan-out signals to be found under the proposed
noise injection scheme.
6.2.3 Preventing zero-padding of signals
Where a signal is truncated from its original word-length no to the word-length nt, the
resulting noise variance σ at the system output is calculated using (6.2), where s is the
system output’s sensitivity to noise injected into the signal and p is the signal’s scaling.
σ =

1
12s2
2p(2−2nt − 2−2no) when no > nt
0 when no ≤ nt
(6.2)
Note that if nt becomes larger than no then the signal is no longer being truncated
6.2 Word-length optimisation modifications for SQP 181
but is instead having zero-valued bits appended to its LSB. However these zero-valued
bits do not represent any useful information and no noise is introduced as a result of their
truncation. As a result truncating a signal that is computed from one or more signals that
contain zero-padded bits will cause less noise to be propagated to the system output than
is estimated by (6.2), as no would not discriminate between bits that contain information
and zero-padded bits, which do not. Rather than attempting to account for situations
where this arises it is more prudent to constrain the word-length selections that can be
made so that zero-valued bits are never appended to any signal in the system.
The constraints required to prevent zero-padding of signals are easily formulated
for each type of component, as shown in (6.3)-(6.6). In (6.3)-(6.6) pn and wln represent the
scaling and word-length of the nth input signal to the component, and pname represents
the scaling of the output signal of the component called name. These constraints are
summarised as follows. The output word-length wladd of an adder must have fewer LSB
bits than the input signal that has the most LSB bits (6.3). The output word-length wlmult
of a multiplier must have fewer bits (after scaling) than the sum of the word-lengths of
its two inputs (6.4). The output word-length wlinput of an input to the system must
be less than a user chosen limit Iwl (6.4). Finally the word-length wli of each branch i
of a multiple fan-out signal with a set of B branches must be shorter than the original
word-length wlfo of the signal (6.6).
wladd − padd ≤ max(wl1, wl2)−max(p1, p2) (6.3)
wlmult ≤ wl1 + wl2 − p1 − p2 + pmult (6.4)
wlinput ≤ Iwl (6.5)
wli ≤ wlfo ∀i ∈ B (6.6)
These inequality constraints are all linear except for (6.3), which uses one max
operator for max(wl1, wl2), (max(p1, p2) can be determined before optimization as the
scalings p1, p2 are fixed values). Unfortunately this max operator makes the constraint
in (6.3) non-convex. This non-convexity can be avoided by constraining the word-length
182 LIST OF TABLES
of one input to an adder to be larger than the word-length of the adder’s other input,
allowing (6.3) to be simplified to:
wladd − padd ≤ wll −max(p1, p2) (6.7)
where wll is the word-length of the adder input that is constrained to be larger than the
other adder input. Of an adder’s two inputs the one whose word-length should be longest
is easily decided as the output of an algorithm is equally sensitive to truncation noise
at either input, so the input with the greatest effect on power consumption should be
truncated most to achieve minimum power.
One maximum word-length constraint is required for each word-length in a system
(and each adder in a system requires an additional constraint to ensure one of its input
word-lengths is larger than the other), and is included as one of the inequality constraints
in h(x) ≤ 0 from (6.1).
Using these and the constraints described in preceding sections the global minimum
of non integer word-length optimisation problems for minimising the area or sum of word-
lengths in a system can be found. Finding the global minimum of word-length optimisation
problems for minimising power consumption is more complex and is approached in the
following sub-section.
6.2.4 Inter-component routing power convexity
As described in Chapter 5, the power consumed in the inter-component routing wires in
a system is estimated by:
• using simple estimates of net fan-out and a linear program to estimate the placement
of components in order to calculate routing wire capacitance values, and,
• using word-level signal statistics to estimate the activity in each signal.
The wire-length values extracted from our method for estimating the placement of
a system are not necessarily convex in relation to word-length, as changing the word-length
6.2 Word-length optimisation modifications for SQP 183
of a signal will affect the size of the component driven by that signal that may in turn cause
the linear program that estimates component placement to find a significantly different
solution to the placement problem. This may result in a discontinuity in wire-length values
as word-lengths change, causing non-convexity.
A second problem with the linear programming formulation for estimating the
wire-length of the inter-routing wires in a circuit is that, although it is many orders of
magnitude faster than true placement, it is still slow compared to the rest of the models
used to estimate power consumption (of the order of 0.5 seconds per placement of a system
as shown in Chapter 5). Additionally, partial derivatives of the power models with respect
to each word-length in a system are required by SQP at each iteration of the optimisation.
Whilst for all other parts of the power model it is possible to derive expressions for the
derivatives with respect to word-length, the only option available for the wire-length values
is to approximate their derivatives by using finite differences, due to the LP formulation.
This means that for a system with n word-lengths, n + 1 evaluations of the LP
formulation for placement estimation are required to calculate the derivatives of wire-
length with respect to word-length at each iteration of SQP. This becomes computationally
expensive for larger systems.
In order to circumvent these two problems, power consumption optimisation is run
in two phases, where each phase involves the solution of an optimisation problem using
SQP. During the first phase only the convex functions that affect power consumption
are optimised while an initial estimate of the non-convex wire-length values is used that
remains constant throughout this first phase. Even though wire-length values do not
change during the first phase, logic power is still optimised, as is routing power, based on
changing activity and fan-out values and the constant wire-length estimates. By taking
this approach it is possible to find an optimal solution to the power consumption estimated
by the convex factors besides wire-length.
Once this optimal solution to the first phase found, it is used as the starting point
for the second phase, in which new wire-length estimates are used at each SQP itera-
tion in conjunction with the other information provided by the power model to estimate
184 LIST OF TABLES
power consumption. It is hoped that the second phase will terminate quickly despite each
iteration of SQP taking longer than during the first phase, as the solution to the first
phase should lie close to the solution of the second phase despite the effects of changing
wire-length not being accounted for during the first phase.
An important point to note is that the closer the initial wire-length estimates
used during the first phase of power consumption are to those that would be given by
the minimal power consumption solution, the better the chance is of finding the global
solution. Obviously the wire-length values at the power optimal solution are unknown
until after optimisation, and instead an estimate of these must be used during the first
phase of power consumption optimisation. The wire-length values at the area-optimal
point are used for this purpose, as they will hopefully be close to those found at the
power-optimal point.
In practice this method of optimising power consumption works well and is able
to find minima in the power consumption cost function that cannot be improved upon
despite extensive experimentation, i.e. it would seem that the global minimum is found.
6.2.5 Summary
The preceding sections have described solutions to issues relating to convexity and the
correct calculation of algorithm accuracy that allow the optimal solution of non integer
word-length optimisation problems to be found. Previous work in the field of word-length
optimisation that was summarised in Section 2.1.4 has not accounted for these issues
(except for the correct modelling of correlated truncation noise in the NP-hard Mixed
Integer Linear Programming word-length selection technique in [CCL02]). The calculation
of lower and upper bounds on the cost of a word-length optimisation problem has also
not been considered in previous work, and neither has the minimisation of the power
consumption of algorithm implementations. Hence the word-length optimisation method
presented in this section represents significant new work in the field.
Work in the following section studies results from the technique described above
that enhance present knowledge on the improvements obtainable from word-length opti-
6.3 Results 185
misation.
6.3 Results
This section presents results obtained from running the word-length optimisation tech-
niques described in the preceding sections on a set of benchmark circuits in order to es-
tablish the range of advantages that can be provided by minimising different word-length
optimisation cost functions.
The objectives of the word-length optimisation procedures compared in the results
which follow are listed below. SQP is used to find non integer solutions to each of these
word-length optimisation problems. Note that all signal scalings and noise sensitivities
are established once only by perturbation analysis [Con03] (which was described in the
background to this thesis in Section 2.1.2), before any optimisations are executed.
Minimise uniform word-length, using individual scalings. The same word-length is
used for each signal, but each signal can use a different scaling.
Minimise sum of word-lengths. Find the solution which gives the minimum sum of
the word-lengths of signals in a system.
Minimise area. Find the solution which gives the minimum system area as estimated
by the area models described in Chapter 3.
Minimise power consumption. Find the solution which gives the minimum system
dynamic power consumption as estimated by the power models described in Chap-
ters 4 and 5.
Table 6.1 shows the benchmark circuits used in the results presented in this section.
Note that in Figures 6.2 and 6.3 and Figures 6.5-6.7 which follow the ‘short names’ of test
systems from Table 6.1 are used in order for these to be displayed in the space available.
In order to give an indication of the complexity of these circuits Table 6.1 also shows
the estimated area of each circuit in SLICEs and the circuit’s estimated dynamic power
186 LIST OF TABLES
consumption when using the minimum non integer uniform word-lengths that achieve an
output noise variance of 10−3. Note that these dynamic power consumption values were
obtained at a somewhat modest clock frequency (30MHz) from XPower estimates.
The circuit types listed in Table 6.1 are explained below.
FIR. Direct form transposed Finite Impulse Response filter.
PE. Polynomial Evaluator, polynomial order given by the ‘order’ column in Table 6.1.
LMS. Least Mean Squared adaptive filter.
IIR. Direct form II transposed Infinite Impulse Response filter.
IIR SOS. Direct form II transposed Infinite Impulse Response filter organised as
Second Order Sections, the ‘order’ column in Table 6.1 indicates the number of
second order sections in the filter.
Note that all circuits are pipelined to the extent that the output of every arithmetic
component in the circuit is registered. The IIR, IIR SOS and LMS circuits must be
implemented as multi-channel filters to allow this level of pipelining.
As mentioned in Chapter 3 the area and dynamic power consumption models used in
this work have been characterised for the Virtex 2 Pro family of Xilinx FPGAs. All results
presented here are for this chip, though the models used could easily be characterised for
other FPGAs.
The results listed below are presented in the following sections.
• The improvements offered by area optimisation and power consumption optimisation
over the other optimisation methods, when non integer word-length optimisation is
performed.
• The lower and upper bounds on area and power consumption for the integer word-
length optimisation problems that are found by the proposed method, and the best
values for these costs obtained by using a heuristic to improve on the upper bound.
6.3 Results 187
Table 6.1: Benchmark Circuits
Short Name Type Order Area (SLICEs) Power (mW)
F7l FIR, low-pass 7 100 1.4
F7h FIR, high-pass 7 100 1.6
F15l FIR, low-pass 15 194 2.8
F15h FIR, high-pass 15 190 3.0
F31l FIR, low-pass 31 383 5.5
F31h FIR, high-pass 31 373 5.9
PE Polynomial evaluator 7 162 2.9
L2 Adaptive LMS 2 181 4.1
L4 Adaptive LMS 4 538 12.0
L8 Adaptive LMS 8 1152 25.0
I2l IIR ladder, low-pass 2 158 3.6
I2h IIR ladder, high-pass 2 139 3.0
I3l IIR ladder, low-pass 3 239 5.8
I3h IIR ladder, high-pass 3 196 4.3
S2l IIR, SOS, low-pass 2 243 5.8
S2h IIR, SOS, high-pass 2 183 3.7
S3l IIR, SOS, low-pass 3 364 8.7
S3h IIR, SOS, high-pass 3 295 6.2
188 LIST OF TABLES
• The run times of the non integer word-length optimisation procedures.
• The improvements that may be obtained when some components do not have regis-
tered outputs and as a result systems contain signals with much higher activity (due
to glitches).
6.3.1 Optimal non integer improvements in area and power consumption
In the following subsections the improvements in area and power consumption offered by
non integer word-length optimisation for area and power respectively will be presented.
Improvements for each of the systems in Table 6.1 will be shown. We will compare the area
and power word-length optimisations to uniform word-length optimisation, minimum sum
of word-lengths optimisation, and each other, in Sections 6.3.1, 6.3.1 and 6.3.1, respec-
tively. All circuits were optimised to achieve an output noise variance (due to truncation
within the circuit) of 10−3 and it is assumed that only embedded multipliers are used for
multiplication.
Area and power improvements over uniform word-length optimisation
Figure 6.2 shows the improvements compared to uniform word-length optimisation of-
fered by area optimisation (a), and power consumption (b), in terms of area and power
consumption in (a) and (b) respectively.
Let us first examine the trends in the results in Figure 6.2(a), where the increase in
area when using the optimum uniform word-lengths instead of the minimum area word-
lengths is shown. A range of differences of up to 40% improvement in area over uniform
word-length optimisation can be seen, with the largest differences in area between area
optimisation and uniform word-length optimisation obtained for the LMS adaptive filter
systems and for the low-pass IIR and IIR SOS systems.
The main reason for the large difference achieved for these systems is due to large
differences in the sensitivities of the outputs of these circuits to truncation noise introduced
to different signals in the circuits. In the LMS adaptive filters for example, truncation
6.3 Results 189
0 10 20 30 40
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Area improvement over Uniform WLs (%)
 
 
Uniform WLs
(a)
0 10 20 30 40
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Power improvement over Uniform WLs (%)
 
 
Uniform WLs
(b)
Figure 6.2: (a) The improvements in area when using the minimum area word-lengths
instead of the optimum uniform word-lengths. (b) The improvements in dynamic
power consumption when using the minimum power word-lengths instead of the op-
timum uniform word-lengths.
at the inputs of the multipliers and accumulators used for the adjustment of coefficient
values causes noise at the output of the system that is several orders magnitude larger
than truncation elsewhere in the circuit.
By its nature uniform word-length optimisation cannot choose to perform less trun-
cation for these signals and thus can only make a small number of word-length reductions
before noise constraints are broken. Multiple word-length optimisation however can choose
to only slightly truncate very noise-sensitive signals whilst achieving improvements by us-
ing more truncation for less noise-sensitive signals.
Note that the high-pass versions of the IIR and IIR SOS filters show a smaller range
of sensitivities than the low-pass versions, and so a smaller area advantage over uniform
word-length optimisation is achieved than for their low-pass counterparts. The difference
in the range of sensitivities of signals between the low-pass and high-pass versions of these
circuits is due to the quantised versions of the chosen low-pass filters being inherently less
stable than the high-pass filter versions. The FIR circuits also show only small differences
between uniform word-length and area optimisation, as once again the range of signal
sensitivities in these circuits is small.
Let us now consider the increase in dynamic power consumption when using the
190 LIST OF TABLES
optimum uniform word-lengths instead of the minimum power word-lengths, as shown in
Figure 6.2(b).
We can see that power consumption optimisation also shows significant benefits
over uniform word-length optimisation, and in general larger improvements are possible
for power over uniform word-lengths than for area over uniform word-lengths, particularly
for the FIR and PE systems.
The differences between uniform word-length optimisation and power consumption
minimisation are due to the same reasons cited for area optimisation. However the larger
power improvements seen for the FIR and PE systems in particular are due to the less
noise-sensitive signals in these circuits having a greater effect on power consumption than
on area. These signals are the inputs to the embedded multipliers in the case of these
circuits, where during area optimisation only one embedded multiplier per multiplication
is required, and hence the area cost of multiplication appears fixed in terms of multiplier
input word-length.
The power consumed by the input signals to embedded multipliers is still affected
by word-length however due to its effect on the number of wires in a signal and the
total activity in the signal. Thus power consumption optimisation can achieve greater
improvements over uniform word-length optimisation as it can attain improvements in
multipliers as well as other component types.
Improvements compared to minimum sum of word-lengths optimisation
Figure 6.3 shows in (a) the improvements in area when using the minimum area word-
lengths instead of the minimum sum of uniform word-lengths, and in (b) the improvements
in dynamic power consumption when using the minimum sum of uniform word-lengths
instead of the minimum power word-lengths.
In both cases we see that in general improvements made over the optimisation of the
minimum sum of word-lengths are small. An average of 2.26% improvement in area and
2.32% in power consumption for area optimisation and power optimisation, respectively,
is achieved across the test systems studied.
6.3 Results 191
0 2 4 6 8 10
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Area improvement over Min. Sum WLs (%)
 
 
Min. Sum WLs
(a)
0 2 4 6 8 10
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Power improvement over Min. Sum WLs (%)
 
 
Min. Sum WLs
(b)
Figure 6.3: (a) The improvements in area when using the minimum area word-lengths
instead of the minimum sum of uniform word-lengths. (b) The improvements in
dynamic power consumption when using the minimum sum of uniform word-lengths
instead of the minimum power word-lengths.
It is not immediately obvious why this would be the case, but understanding this
small difference between what would seem to be significantly different optimisations takes
us to the heart of the trade-offs available in word-length optimisation and as such this
result will now be discussed in detail.
Before we discuss the similarities or otherwise of the three types of word-length
optimisation in question, let us first consider an example that demonstrates the range
of results possible for word-length optimisation problems with identical noise constraints
but different cost functions. Figure 6.4 shows the optimum cost points of several simple
word-length optimisation problems. The word-length optimisation problems are identical
except for their cost functions: they each contain two signals whose un-truncated word-
lengths are 16 bits, and each problem is optimised to a achieve a total noise variance of
1
12(2
−2×14− 2× 2−2×16). Let us assume that each signal is equally sensitive to truncation
and has a scaling of zero, so that using (6.2) to calculate the noise contributed by each
signal we can establish the values of the two word-lengths wlx and wly that meet the above
192 LIST OF TABLES
14 15 16
29
29.2
29.4
29.6
29.8
30
30.2
wl
x
Sy
ste
m
 c
os
t
(a) System cost: wl
x
 + wl
y
14 15 16
43
44
45
46
47
wl
x
Sy
ste
m
 c
os
t
(b) System cost: 2*wl
x
 + wl
y
14 15 16
70
72
74
76
78
80
wl
x
Sy
ste
m
 c
os
t
(c) System cost: 4*wl
x
 + wl
y
14 15 16
125
130
135
140
145
wl
x
Sy
ste
m
 c
os
t
(d) System cost: 8*wl
x
 + wl
y
(14.5, 14.5) (14.3, 14.8)
(14.2, 15.2) (14.1, 15.8)
Figure 6.4: The optimum cost points of several 2-signal systems with word-lengths
wlx and wly. Each plot shows the cost of the system (y-axis) as wlx (x-axis) and wly
(not shown) are changed whilst meeting a fixed noise constraint. Each system uses a
different cost function as shown below each plot. The optimum point for system (a)
is marked by a circle, and the same point is also marked by a circle in plots (b)-(d).
The crosses in plots (b)-(d) mark the optimum points for each of those plots. The
values of wlx and wly at each optimum point is shown in brackets.
noise variance constraint, as shown by (6.8).
1
12
(2−2×wlx − 2−2×16 + 2−2×wly − 2−2×16) = 1
12
(2−2×14 − 2× 2−2×16)
i.e. 4−wlx + 4−wly = 4−14
(6.8)
The range of real values for wlx and wly obtained by solving (6.8) are plotted in
Figure 6.4 against four different cost functions (a)-(d), and the optimum point of each
resulting word-length optimisation problem is shown. Of particular interest here is the
change in the value of the cost function at the optimum point as we gradually change cost
function from problem (a) to cost function (d). Note that the optimum point for problem
(a) is indicated by a circle and is also shown in the subsequent problems (b)-(d), whose
optimum points are represented by crosses. If the cost associated with the signal whose
word-length is wlx is doubled (as occurs if we move from cost function (a) to (b)), then
the optimum point from problem (a) under the new cost function (b) has a cost that is
only 0.28% greater than the cost of the optimal solution to problem (b). Similarly if the
cost of the signal whose word-length is wlx is multiplied by 4 (i.e. move from (a) to (c))
or multiplied by 8 (i.e. move from (a) to (d)), then the cost associated with the original
optimum point (a) under the new cost functions is only 0.97% and 1.74% greater than the
optimum points of (c) and (d) respectively.
6.3 Results 193
Figure 6.4 illustrates a key aspect of multiple word-length optimisation and serves
to explain why solutions to multiple word-length optimisations with different cost func-
tions such as area, power, and minimum sum of word-lengths, have similar cost values
when we move from one cost function to another. Whilst these cost functions are either
linear or quadratic in terms of word-length (discounting non-convexities due to embedded
multipliers and routing wire-lengths), the noise function is exponentially related to word-
length. As a result only small changes in word-lengths can be made without violating
noise constraints.
At the minimum sum of word-lengths point it may seem that significant area or
power consumption improvements can be made by decreasing the word-length of a high-
cost signal, however doing so will cause the truncation noise associated with that signal
to increase exponentially. Every 1-bit decrease in the word-length of a signal causes the
truncation noise associated with that signal to quadruple, which may require the word-
lengths of several other signals to be increased in order for word-length constraints to
continue to be met. As increasing or decreasing word-lengths results in only percentage
changes in the linear or quadratic cost function, percentage differences between multiple
word-length optimisation schemes with cost functions of a similar order should be expected.
Having explained the reasons for the similarity between the results of the differ-
ent multiple word-length optimisation schemes above, let us now examine the differences
between these that allow small improvements between them to be made.
In the case of area optimisation, one embedded multiplier has been used for each
multiplication in the test systems (i.e. no multipliers greater than 18×18 bits are required
to satisfy the noise constraints used for these results), as a result changes in word-length
do not affect the area of the multipliers in the circuits. The area cost of adders however
is linearly related to word-length (see Chapter 3), so slight differences between the cost of
adders and delays, and the fact that embedded multipliers appear to have no effect on area,
causes slight differences between the area achieved by area optimisation and minimum sum
of word-lengths optimisation.
Differences between the power consumption associated with the word-lengths of a
194 LIST OF TABLES
system that can cause different optimal results to minimum sum of word-length optimi-
sations arise from the following factors: i) the input signal activity of components, ii)
differences in the power consumed by different component types, and, iii) different capaci-
tances of inter-component routing wires. Let us now examine each of these factors in more
detail.
i) The input signal activity of components. The power consumption of 16-bit input
adders with high activity (signal lag-1 autocorrelation ρ of −0.98) inputs is 1.48 times
larger than those with low activity (ρ of +0.98), and for 18-bit input embedded multipliers
the difference is 1.76 times. In practice however, the test circuits used in this work do not
exhibit much variation in lag-1 autocorrelation between signals, and so such extremes of
power consumption variation within components of the same type have not been observed.
Additionally, even if circuits with such large extremes of signal activity had been tested,
Figure 6.4 shows that differences in component power of the order of magnitude of 2 are
unlikely to achieve significant improvements.
ii) Differences in the power consumed by different component types. Adders and
embedded multipliers show power consumption values of a similar order on the Virtex II
Pro, for example, a multiplier with 18-bit inputs has a power consumption only 1.36 times
larger than an adder with 18-bit inputs when they are both driven by signals that are
uncorrelated.
iii) Different capacitances of inter-component routing wires. The capacitance of
an inter-component routing wire is determined by the number, size of, and placement of
components connected to by that wire. The circuits studied have topologies such that
most signals connect one output of a component to the input of another, whilst a small
number of signals connect multiple components together. Each circuit studied has at least
one signal with several fan-outs, and it is these signals which are the result of the greatest
differences between power consumption optimisation and minimum sum of word-length
optimisation, as a high fan-out wire’s power consumption may be at least an order of
magnitude larger than that of other wires in a circuit.
6.3 Results 195
0 5 10 15 20
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Area improvement (%) over area of Min. Power Opt.
 
 
Min. Power
(a)
0 5 10 15 20
S3h
S3l
S2h
S2l
I3h
I3l
I2h
I2l
L8
L4
L2
PE
F31h
F31l
F15h
F15l
F7h
F7l
Power improvement (%) over the power of Min. Area Opt.
 
 
Min. Area
(b)
Figure 6.5: (a) The improvements in area when using the minimum area word-lengths
instead of the minimum power word-lengths. (b) The increase in dynamic power
consumption when using the minimum power word-lengths instead of the minimum
area word-lengths.
Comparison of minimum area and minimum power optimisations
The results in Figure 6.5 show: the difference between the area of power optimal circuits
and area optimal circuits in (a), and the difference between the power of area optimal
circuits and power optimal circuits in (b).
There is a significant difference between the two sets of results. Area optimisation
achieves an improvement of only 0.52% on average over the area of power optimisation.
Power consumption optimisation achieves a range of improvements between 1% and 15%
over the power consumption of area-optimal circuits.
First, let us discuss the larger improvements in power consumption achieved for
the FIR, PE and LMS circuits. The advantages here arise due to the fact that area opti-
misation does not heavily truncate the inputs to embedded multipliers in these systems,
as only one embedded multiplier is required per multiplication in order to achieve noise
constraints. As a result the power consumption of the signals driving these multipliers is
high at the area optimal point.
Power consumption optimisation can significantly reduce the power of these signals
over area optimisation, despite the exponential relation between noise and word-length,
196 LIST OF TABLES
because these signals have not been heavily truncated and so can be reduced substantially
before they begin to significantly affect noise constraints. Area optimisation has heavily
truncated the inputs to adders in these circuits, but these consume less power than the
large word-length nets that drive the multipliers in these circuits. Power consumption op-
timisation achieves lower power by increasing the word-lengths of the less power-intensive
adders whilst making gains by reducing the word-lengths of the power-hungry nets that
drive multipliers.
This is an important phenomenon: if a change in cost function results in some
components affecting cost that had little or no effect at all under the previous cost function,
then significant differences between the resulting optimisations can arise.
It is important to highlight the difference between this phenomenon and that seen
earlier in the optimisation of the minimum sum of word-lengths. When optimising the
minimum sum of word-lengths every word-length has the same effect on cost. Optimisation
of the minimum sum of word-lengths will hence truncate every signal as much as is possible
within the given noise constraints. Other multiple word-length optimisation problems may
not be able to achieve more than small percentage improvements over optimisation of the
minimum sum of word-lengths as each signal has already been truncated significantly and
because noise constraints are exponentially related to word-length.
In contrast if one optimisation sees little or no need to truncate some signals, but
another does, significant differences in terms of the cost function of the second optimisation
can be achieved. This sheds additional light on why uniform word-length optimisation is
very poor at achieving good results in terms of area and power consumption: because
a small number of very noise-sensitive signals quickly end up breaking noise constraints
whilst a larger number of less noise-sensitive signals are not truncated much and so cause
high costs.
The area consumption of power optimal systems is very close to that obtained by
area optimisation for the same reasons that the minimum sum of word-lengths optimisation
is close to area optimisation and to power optimisation. These are that, like the minimum
sum of word-lengths optimisation, power consumption optimisation will truncate all word-
6.3 Results 197
lengths to some extent. As a result the exponential relation between noise constraints and
word-length prevents significant differences between the two optimisations. Even though
it may be more attractive for power consumption optimisation to truncate some signals
over others, this is not enough to make the area gap between it and area optimisation very
large.
There is a final trend in the results in Figure 6.5 that has not yet been explained:
why do the IIR and IIR SOS systems not show significant power consumption improve-
ments over the power achieved by area optimisation? Whilst power consumption optimi-
sation once again gains an advantage over area optimisation by more heavily truncating
the inputs to multipliers in these circuits, these gains are quickly balanced out by the cost
of increasing the word-lengths of the adders in the circuit as these have pipeline registers
in between them to allow three channels of data to be processed. In contrast the FIR, PE
and LMS circuits do not require extra pipeline registers between the adders of the circuit,
and so power consumption optimisation can increase the word-lengths of those adders at
the cost of a small increase in power consumption.
6.3.2 Integer problem lower and upper bounds
Figures 6.6 and 6.7 show the differences in area (Figure 6.6) and power consumption
(Figure 6.7) between the lower and upper-bounds on the costs of the integer versions
of the word-length optimisation problems for area and power consumption minimisation.
The lower bounds are obtained from the optimal non integer word-lengths found for each
circuit, whilst the upper bounds are found by rounding these optimal non integer word-
lengths up to the nearest integers, as described in Section 6.2.1. The figures also show
the gap between the lower-bound on the integer word-length problems and the integer
word-length result found by the word-length optimisation heuristic described in [CCL01]
and summarised in Section 6.2.1.
The average gap in area between the lower and upper-bounds is 2.9%, whilst the
average gap in area between the lower-bound and the heuristic result is 1.4%. For power
consumption optimisation the average gap between the lower and upper-bounds is 5.1%,
198 LIST OF TABLES
F7l F7h F15l F15h F31l F31h PE L2 L4 L8 I2l I2h I3l I3h S2l S2h S3l S3h
0
5
10
15
Short system names
A
re
a 
in
cr
ea
se
 c
om
pa
re
d
 
to
 n
on
−i
nt
eg
er
 o
pt
im
um
 (%
)
 
 
Integer upper−bound
Heuristic result
Figure 6.6: The percentage increase in area of the integer word-length upper-bound
and heuristic solution, relative to the lower-bound given by the minimum area non
integer word-lengths.
F7l F7h F15l F15h F31l F31h PE L2 L4 L8 I2l I2h I3l I3h S2l S2h S3l S3h
0
5
10
15
Short system names
Po
w
er
 in
cr
ea
se
 c
om
pa
re
d
 
to
 n
on
−i
nt
eg
er
 o
pt
im
um
 (%
)
 
 
Integer upper−bound
Heuristic result
Figure 6.7: The percentage increase in power consumption of the integer word-length
upper-bound and heuristic solution, relative to the lower-bound given by the minimum
power non integer word-lengths.
6.3 Results 199
F7l F7h F15l F15h F31l F31h PE L2 L4 L8 I2l I2h I3l I3h S2l S2h S3l S3h
0
5
10
15
20
25
30
35
40
Short system names
N
on
−i
nt
eg
er
 w
or
d−
le
ng
th
o
pt
im
iz
at
io
n 
tim
e 
(m
inu
tes
)
 
 
Uniform WL Opt
Minimum Sum WLs Opt
Area Opt
Power Opt
Figure 6.8: The time required to run the non integer word-length optimisation pro-
cedures
whilst the average gap between the lower-bound and the heuristic result is 2.3%.
These results show that tight lower and upper-bounds on the integer word-length
optimisation problems for area and power optimisation can be achieved using the proposed
method, and that the simple integer word-length optimisation heuristic used can improve
on the integer upper-bound found and obtain cost values close to the lower-bounds found.
6.3.3 Run times
Figure 6.8 shows the amount of time taken in minutes to run the non integer word-length
optimisation procedures on a 3GHz Pentium 4 computer. All optimisation procedures and
supporting code are implemented entirely in Matlab, hence significant speed-up of these
results could be achieved with a C implementation, for example.
As expected the simple uniform word-length optimisations are very fast and com-
plete in 0.7 seconds on average. Both the optimisation of the minimum sum of word-lengths
and area optimisation show similar computation times, completing in 57 seconds and 39
seconds on average, respectively. The longest time taken by these two optimisations to
complete is around 5 minutes, for the larger FIR test system.
It is somewhat surprising that area optimisation generally completes in a shorter
time than minimum sum of word-length optimisation, as due to the extra computation
required to calculate the area cost of a system we would expect it to take longer. It has
200 LIST OF TABLES
been observed however that word-length optimisation for area generally finds its optimal
solution in fewer SQP iterations than minimum sum of word-length optimisation does,
hence its lower computation time on average. The smaller number of SQP iterations for
area optimisation can be explained by the fact that embedded multipliers have a constant
area and hence word-lengths that can be used to control the area of these components need
not be changed during the course of area optimisation resulting in fewer iterations before
SQP descends to the optimal point, due to the apparent reduction in the dimensionality
of the problem.
Power consumption optimisation shows a range of computation times ranging from
around 30 seconds to up to almost 40 minutes for the largest systems. Power consumption
optimisation takes significantly longer to execute for two reasons:
• Partial derivative values of power consumption with respect to word-length that
are required by SQP are not currently calculated for logic power, inter-routing wire
activity or fan-outs. Instead these partial derivative values are estimated at each
SQP iteration by using finite differences of the power consumption cost function.
• It is not possible to derive equations to calculate the partial derivative values of
inter-routing wire-lengths with respect to word-length, so these have be calculated
using finite differences.
We expect that significant speedups could be obtained by calculating the partial
derivative values for logic power, inter-routing wire activity and fan-outs at each SQP
iteration.
6.3.4 Power optimisation improvements in un-pipelined circuits
The power consumption models used in this work assume that the outputs of all arithmetic
components in a circuit are registered so that glitches arising from logic transitions within
these components are not propagated through inter-routing wires or to other components
within a circuit. As was discussed in Section 4.5 work has not been conducted on the
estimation of the power consumed in components whose inputs contain glitches, and it
6.3 Results 201
Mult1 Mult2Register1 Adder1 Register2
z
-1
z
-1
z
-1
z
-1
z
-1
Figure 6.9: An un-pipelined path in a circuit. The increased power consumption
as a result of glitches passed between components is modelled by doubling the inter-
routing power of Mult1, doubling the internal power of Adder1, quadrupling the inter-
routing power of Adder1, and quadrupling the internal power of Mult2. Register2 is
incorporated into the output of Mult2 and hence the inter-routing power of Mult2 is
not increased.
was shown that doing so accurately is a difficult problem that is left for future research.
Nevertheless this section aims to use simple approximations for the power consumed in
such situations in order to attain insight into the resulting advantages achievable by word-
length optimisation for power consumption minimisation.
It may not always be desirable to register the output of every arithmetic component,
particularly for circuits that contain feedback loops that must be implemented as multi-
channel circuits if full pipelining is performed. When fully pipelined the IIR and IIR SOS
filters from Table 6.1 process 3 separate channels of data, whilst the LMS filters process
6, 7 or 8 separate channels, the number of channels increasing with filter order.
In this section we present power consumption minimisation results for single chan-
nel, un-pipelined versions of the IIR, IIR SOS and LMS filters, and the Polynomial Eval-
uator circuit. The power consumption Pu in a component whose inputs are not registered
is approximated by multiplying the component’s power Pr when inputs are registered (es-
timated by our existing models) by a factor dependent on the number of un-registered
components preceding it:
Pu = Pr2D (6.9)
where D is the number of un-registered components preceding the component. Extra
power due to glitches in the inter-routing wires after the component is estimated by mul-
tiplication by the same factor 2D. Figure 6.9 depicts an example of how un-pipelined
components are modelled for a simple un-pipelined path.
202 LIST OF TABLES
IIR SOS2 LP Poly Eval LMS4 IIR3 LP
0
5
10
15
20
25
30
System names
Po
w
er
 im
pr
ov
em
en
t
gi
ve
n 
by
 P
ow
er
O
pt
 (%
)
 
 
Minimum Sum WLs
Area Optimal WLs
Figure 6.10: The non integer word-length optimisation for power minimisation im-
provements achieved compared to the optimisations shown when un-registered com-
ponents are modelled.
Figure 6.10 shows the improvements in dynamic power consumption obtained by
non integer word-length optimisation for power consumption minimisation over the word-
length optimisation procedures for minimisation of the sum of word-lengths and of area.
Compared to the results that contrasted power consumption optimisation and min-
imum sum of word-length optimisation in Section 6.3.1 we see that the much larger range
of component power consumptions has given rise to a much larger gap between the power
consumption of the two optimisations, with power consumption optimisation achieving
almost a 10% improvement for the IIR 3 LP system.
The difference between power consumption optimisation and the power of area
optimisation is also large, particularly for the IIR 3 LP and SOS 2 LP circuits whose
pipelined versions both showed a gap of less than 5% between the power consumption of
the two optimisations in Section 6.3.1. Here we see a difference of greater than 10% for
the four circuits, with improvements of almost 30% for the IIR 3 LP circuit.
These results indicate that in cases where not every output of an arithmetic com-
ponent in a circuit can be pipelined, word-length optimisation for power consumption
minimisation can offer large improvements in power consumption over other available
techniques.
6.4 Conclusion 203
6.4 Conclusion
This chapter has described a novel technique for finding optimal or near-optimal solu-
tions to the problem of integer word-length selection for area and power consumption
optimisation. Methods for ensuring that the correct optimum solutions are found despite
non-convexity in some of the cost and constraint functions were described that have not
been identified in existing work. Tight upper and lower bounds on the costs obtainable
are achieved by solving a relaxed version of the problem where integer word-length con-
straints are removed. The upper bounds for area and power consumption are 2.9% and
5.1% larger than the lower bounds on average, respectively. Heuristics can then improve
on the upper bounds giving integer word-length solutions whose area and power is 1.4%
and 2.3% larger than the lower bounds on average, respectively.
The first complete set of dynamic power consumption models suitable for quickly
evaluating points in the design space have been used to provide the first set of results
for multiple word-length optimisation for power consumption minimisation. Results show
that for a specific set of noise constraints area and power consumption can be improved
by up to 40% over uniform word-length optimisation when integer constraints are relaxed.
The difference between the cost of optimal points of different cost functions for
multiple word-length optimisation has been observed to be small at approximately 2% on
average, mainly due to the exponential increase in quantisation noise with decreases in
word-length. For algorithm implementations where component costs in terms of area and
power are of a similar order the optimisation of the sum of word-lengths generally provides
good results at high speed, and without the need for complex models.
However when the cost of components in a system differ in orders of magnitude
between cost functions differences in power consumption between the optimal points of
different cost functions of up to 30% are observed.
204 LIST OF TABLES
205
Chapter 7
Conclusion
Many problems are faced by designers of high performance implementations of algorithms.
The work presented in this thesis has approached two problems that are currently and
will undoubtedly continue to be amongst the most critical in hardware design: those of
reducing design complexity and of reducing system power consumption.
A novel word-length selection technique has been developed that allows tight
bounds and near-optimal solutions to particular problems to be found. The technique
has been used to compare the optimal points that can be achieved by several different cost
functions, and as a result important insights into the nature of word-length optimisation
problems have been gained.
Word-length optimisation for power consumption minimisation has been performed
for the first time using a set of power consumption models for DSP algorithms implemented
on an FPGA device. These power consumption models allow the fast and accurate es-
timation of the dynamic power consumed within the arithmetic components of a DSP
algorithm, and in the inter-routing wires that connect these components. These models
are able to estimate the power consumed by these algorithm implementations quickly by
trading off a small reduction in their accuracy for a significant increase in the speed at
which they can be evaluated.
It is expected that the reduction in power model accuracy does not significantly
affect the results of the comparison of word-length optimisation cost functions performed
206 LIST OF TABLES
in this thesis. This is because it is shown that orders of magnitude difference between cost
functions are required to bring about large changes in the optimal solutions to these cost
functions for word-length optimisation, due to the exponential relationship between word-
lengths and quantisation noise. Hence the optimal solutions to word-length optimisation
for power consumption minimisation problems should be very close to optimal for the real
circuits.
The results of the work developed in this thesis suggest that future work in the field
of word-length optimisation should be particularly focussed in several areas, as summarised
in the following.
First of all, work in Chapter 6 highlighted that large differences in the power con-
sumption of different components in a system is necessary to cause significant differences
between the optimal points of word-length optimisation cost functions. Hence the identi-
fication and modelling of computations and algorithms whose component costs differ by
orders of magnitude should be an area of continuing research in the field. Power models
for components whose inputs are not registered are an ideal candidate because as shown in
Chapter 4 the number of glitches that occur at the output of an unregistered component
can be very large, causing subsequent components to have highly elevated levels of power
consumption. The difficulties of estimating the power consumed in a group of components
that have no registers between them was summarised in Chapter 4 and the problem should
provide an interesting avenue for future work.
Whilst the work in Chapter 6 described a method for calculating tight lower and
upper bounds on the optimal cost of a word-length optimisation problem, a simple heuristic
was used to improve on the upper bound that did not take advantage of the information
available regarding the known lower bound. A new word-length optimisation heuristic
that is able to continue performing word-length changes until the cost of the integer word-
length solution found lies within a chosen percentage of that of the non integer solution
would provide a guaranteed level of optimality for word-length optimisation problems
whilst still finding solutions quickly. Additionally it may be possible to use the lower and
upper bounds found to help prune many points in the design space during a Branch and
7. Conclusion 207
Bound search (see [KV05]) for the optimal integer word-length solution. Both of these
possibilities should be explored.
Finally, although the work in this thesis has provided new methods with which
to perform word-length optimisation and to model the power consumption of DSP al-
gorithms implemented on FPGAs, there remain a multitude of word-length optimisation
problems applicable to algorithms in other domains whose accuracy may be measured dif-
ferently and for whom existing quantisation noise analysis techniques are not applicable
due to the nature of these algorithms. It is essential that future work continues to explore
the possibilities of quickly estimating the quantisation noise introduced into algorithm
implementations for which no current methods exist.
208 LIST OF TABLES
209
Bibliography
[ACS94] A. Andrade, J. Comba, and J. Stolfi. Affine arithmetic. In Proceedings of
the International Conference on Interval and Computer-Algebraic Methods in
Science and Engineering, 1994.
[AN04] J. H. Anderson and F. N. Najm. Power estimation techniques for FPGAs.
IEEE Transactions on VLSI Systems, 12(10):1015–1027, 2004.
[BB07] S. Bhoj and D. Bhatia. Pre-route interconnect capacitance and power estima-
tion in FPGAs. pages 159–164, 2007.
[BHS98] S. Bobba, I. N. Hajj, and N. R. Shanbhag. Analytical expressions for average
bit statistics of signal lines in DSP architectures. In Proceedings of the IEEE
International Symposium on Circuits and Systems, volume 6, pages 33–36,
1998.
[Bog95] P. T. Boggs. Sequential quadratic programming. Acta Numerica, 4:1–52, 1995.
[Bor87] K. H. Borgwardt. The Simplex Method, A Probabilistic Analysis. Springer-
Verlag, 1987.
[BP00] A. Benedetti and P. Perona. Bit-width optimization for configurable DSPs by
multi-interval analysis. In Proceedings of the Asilomar Conference on Signals
Systems and Computers, volume 1, pages 355–359, 2000.
[BR05] P. Belanovic and M. Rupp. Automated floating-point to fixed-point conversion
with the fixify environment. The IEEE International Workshop on Rapid
System Prototyping, pages 172–178, 2005.
210 BIBLIOGRAPHY
[BT02] J. L. Beuchat and A. Tisserand. Small multiplier-based multiplication and
division operators for Virtex-II devices. In Proceedings of the International
Conference on Field Programmable Logic and Applications, pages 513–522,
2002.
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[CCC07] J. A. Clarke, G. A. Constantinides, and P. Y. K. Cheung. On the feasibility
of early routing capacitance estimation for FPGAs. In Proceedings of the In-
ternational Conference on Field Programmable Logic and Applications, pages
234–239, 2007.
[CCC08] J. A. Clarke, G. A. Constantinides, and P. Y. K. Cheung. Word-length se-
lection for area and power minimization via non-linear optimization. ACM
Trans. Des. Autom. Electron. Syst., (Submitted) 2008.
[CCCS08] J. A. Clarke, G. A. Constantinides, P. Y. K. Cheung, and A. M. Smith. Glitch-
aware output switching activity from word-level statistics. In Proceedings of
the IEEE International Symposium on Circuits and Systems, pages 1792–1795,
2008.
[CCL99] G. A. Constantinides, P. Y. K. Cheung, and W. Luk. Truncation noise in
fixed-point sfgs. IEE Electronics Letters, 35(23):2012–2014, 1999.
[CCL01] G. A. Constantinides, P. Y. K. Cheung, and W. Luk. The multiple wordlength
paradigm. In Proceedings of the IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 51–60, 2001.
[CCL02] G. A. Constantinides, P. Y. K. Cheung, and W. Luk. Optimum wordlength
allocation. In Proceedings of the IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 219–228, 2002.
[CCP06] D. Chen, J. Cong, and P. Pan. FPGA design automation: A survey. Founda-
tions and Trends in Electronic Design Automation, 1(3):139–169, 2006.
BIBLIOGRAPHY 211
[CGC05] J. A. Clarke, A. A. Gaffar, and G. A. Constantinides. Parameterized logic
power consumption models for FPGA-based arithmetic. In Proceedings of
the International Conference on Field Programmable Logic and Applications,
pages 626–629, 2005.
[CGCC06] J. A. Clarke, A. A. Gaffar, G. A. Constantinides, and P. Y. K. Cheung.
Fast word-level power models for synthesis of FPGA-based arithmetic. In
Proceedings of the IEEE International Symposium on Circuits and Systems,
pages 1299–1302, 2006.
[Chu74] K.-L. Chung. A Course in Probability Theory. Academic Press, 1974.
[CNC96] A. Chatterjee, M. Nandakumar, and I.-C. Chen. An investigation of the
impact of technology scaling on power wasted as short-circuit current in low
voltage static CMOS circuits. In Proceedings of the International Symposium
on Low Power Electronics and Design, pages 145–150, 1996.
[Con03] G. A. Constantinides. Perturbation analysis for word-length optimization. In
Proceedings of the IEEE Symposium on Field-Programmable Custom Comput-
ing Machines, pages 81–90, 2003.
[CRS+99] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and I. Bolsens. A method-
ology and design environment for DSP ASIC fixed point refinement. In Pro-
ceedings of the IEEE Conference on Design, Automation and Test in Europe,
pages 271–276, 1999.
[CSPL01] M.-A. Cantin, Y. Savaria, D. Prodanos, and P. Lavoie. An automatic word
length determination method. In Proceedings of the IEEE International Sym-
posium on Circuits and Systems, volume 5, pages 53–56, 2001.
[CT06] S. C. Chan and K. M. Tsui. The wordlength determination problem of lin-
ear time invariant systems with multiple outputs - a geometric programming
approach. Proceedings of the IEEE International Symposium on Circuits and
Systems, pages 5211–5214, 2006.
212 BIBLIOGRAPHY
[CW02] G. A. Constantinides and G. J. Woeginger. The complexity of multiple
wordlength assignment. Applied Mathematics Letters, 15(2):137–140, 2002.
[CW06] N. C. K. Choy and S. J. E. Wilton. Activity-based power estimation and
characterization of DSP and multiplier blocks in FPGAs. In Proceedings of
the IEEE International Conference on Field Programmable Technology, pages
253–256, 2006.
[DT05] V. Degalahal and T. Tuan. Methodology for high level estimation of FPGA
power consumption. In Proceedings of the Asia and South Pacific Design
Automation Conference, volume 1, pages 657–660, 2005.
[EFD+89] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco. Estimate of signal
probability in combinational logic networks. In Proceedings of the European
Test Conference, pages 132–138, 1989.
[FWAW05] M. French, L. Wang, T. Anderson, and M Wirthlin. Post synthesis level
power modeling of FPGAs. In Proceedings of the IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 281–282, 2005.
[GMLC04] A. A. Gaffar, O. Mencer, W. Luk, and P. Y. K. Cheung. Unifying bit-width op-
timisation for fixed-point and floating-point designs. Proceedings of the IEEE
Symposium on Field-Programmable Custom Computing Machines, pages 79–
88, 2004.
[Gra06] Mentor Graphics. ModelSim SE guide. Version 6.2c, 2006.
[HE06] K. Han and B. L. Evans. Optimum wordlength search using sensitivity infor-
mation. EURASIP Journal on Applied Signal Processing, (5):1–14, 2006.
[Hua03] Z. Huang. High-level optimization techniques for low-power multiplier design.
PhD thesis, University of California, Los Angeles, 2003.
[Jac70] L. B. Jackson. On the interaction of roundoff noise and dynamic range in
digital filters. Bell Systems Technical Journal, 49:159–184, February 1970.
BIBLIOGRAPHY 213
[JTB04] T. Jiang, X. Tang, and P. Banerjee. Macro-models for high level area and
power estimation on FPGAs. In Proceedings of the ACM Great Lakes Sympo-
sium on VLSI, pages 162–165, 2004.
[KAB+03] N. S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. S. Hu, M. J.
Irwin, M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets
static power. Computer, 36(12):68–75, Dec. 2003.
[KB06] P. Kannan and D. Bhatia. Interconnect estimation for FPGAs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
25(8):1523–1534, 2006.
[KKS98] S. Kim, K.-I. Kum, and W. Sung. Fixed-point optimization utility for C and
C++ based digital signal processing programs. IEEE Transactions on Circuits
and Systems II: Analog and Digital Signal Processing, 45(11):1455–1464, 1998.
[KN00] S. Kotz and K. Nadarajah. Extreme Value Distributions, Theory and Appli-
cations. Imperial College Press, London, United Kingdom, 2000.
[Knu97a] D. E. Knuth. The art of computer programming, volume 1: Fundamental
Algorithms. Addison-Welsey, 1997.
[Knu97b] D. E. Knuth. The art of computer programming, volume 3: Sorting and
Searching. Addison-Welsey, 1997.
[KR06] I. Kuon and J. Rose. Measuring the gap between FPGAs and ASICs. In Pro-
ceedings of the ACM International Symposium on Field Programmable Gate
Arrays, pages 21–30, 2006.
[KT89] B. Krishnamurthy and I. G. Tollis. Improved techniques for estimating signal
probabilities. IEEE Transactions on Computers, 38(7):1041–1045, 1989.
[KV05] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms.
Springer-Verlag, 2005.
214 BIBLIOGRAPHY
[LGC+06] D.-U. Lee, A. A. Gaffar, R. C. C. Cheung, O. Mencer, W. Luk, and G. A. Con-
stantinides. Accuracy-guaranteed bit-width optimization. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 25(10):1990–
2000, 2006.
[LLH+05] F. Li, Y. Lin, L. He, D. Chen, and J. Cong. Power modeling and characteristics
of field programmable gate arrays. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 24(11):1712–1724, 2005.
[LR96] P. E. Landman and J. M. Rabaey. Activity-sensitive architectural power anal-
ysis: the dual bit type method. IEEE Transactions on Computer Aided Design
of Integrated Circuits and Systems, 15(6):571–587, 1996.
[Mat6a] The MathWorks. MATLAB and Simulink for Technical Computing. Version
2006a.
[Moo65] G. Moore. Cramming more components onto integrated circuits. Electronics
Magazine, 86(1):82–85, 1965.
[Naj93] F. N. Najm. Transition density: A new measure of activity in digital cir-
cuits. IEEE Transactions on Computer Aided Design of Integrated Circuits
and Systems, 12(2):310–323, 1993.
[NHCB01] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee. Precision and error
analysis of MATLAB applications during automated hardware synthesis for
FPGAs. In Proceedings of the IEEE Conference on Design, Automation and
Test in Europe, pages 722–728, 2001.
[ONG04] E. Ozer, A. P. Nisbet, and D. Gregg. Stochastic bit-width approximation using
extreme value theory for customizable processors. In Proceedings International
Conference on Compiler Construction, pages 250–264, 2004.
[OS75] A. V. Oppenheim and R. W. Schafer. Digital signal processing. Prentice-Hall,
1975.
[Pig04] C. Piguet. Low power electronics design. CRC Press, 2004.
BIBLIOGRAPHY 215
[Poo02] K. K. W. Poon. Power estimation for field programmable gate arrays. Master’s
thesis, Department of Electrical and Computer Engineering, University of
British Columbia, August 2002.
[PYW02] K. K. W. Poon, A. Yan, and S. J. E. Wilton. A flexible power model for FP-
GAs. In Proceedings of the International Conference on Field Programmable
Logic and Applications, pages 312–321, 2002.
[RB05] S. Roy and P. Banerjee. An algorithm for trading off quantization error with
hardware resources for MATLAB-based FPGA design. IEEE Transactions on
Computers, 54(7):886–896, 2005.
[RJD98] A. Raghunathan, N. K. Jha, and S. Dey. High-Level Power Analysis and
Optimization. Kluwer Academic Publishers, 1998.
[RP00] K. Roy and S. Prasad. Low-power CMOS VLSI circuit design. Wiley-
Interscience, 2000.
[RSN06] A. Reimer, A. Schulz, and W. Nebel. Modelling macromodules for high-level
dynamic power estimation of FPGA-based digital designs. In Proceedings of
the International Symposium on Low Power Electronics and Design, pages
151–154, 2006.
[SB04] C. Shi and R. W. Brodersen. Automated fixed-point data-type optimization
tool for signal processing and communication systems. In Proceedings of the
ACM Design Automation Conference, pages 478–483, 2004.
[SBA00] M. Stephenson, J. Babb, and S. Amarasinghe. Bitwidth analysis with applica-
tion to silicon compilation. In Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation, pages 108–120, 2000.
[SJ01] L. Shang and N. K. Jha. High-level power modeling of CPLDs and FPGAs.
In Proceedings of the International Conference on Computer Design, pages
46–53, 2001.
216 BIBLIOGRAPHY
[SK95] W. Sung and K.-I. Kum. Simulation-based word-length optimization method
for fixed-point digital signal processing systems. IEEE Transactions on Signal
Processing, 43(12):3087–3090, 1995.
[SKB02] L. Shang, A. S. Kaviani, and K. Bathala. Dynamic power consumption in
VirtexTM -II FPGA family. In Proceedings of the ACM International Sympo-
sium on Field Programmable Gate Arrays, pages 157–164, 2002.
[Smi06] A. M Smith. Heterogeneous Reconfigurable Architecture Design: An Optimi-
sation Approach. PhD thesis, Department of Electrical and Electronic Engi-
neering, Imperial College London, 2006.
[SS91] A. S. Sedra and K. C. Smith. Microelectronic circuits. Saunders, 1991.
[Syn06] Synplicity. FPGA Synthesis Tools: Synplify Pro guide. Version 8.8, 2006.
[TCW+05] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk,
and P. Y. K. Cheung. Reconfigurable computing: architectures and design
methods. IEE Proceedings on Computers and Digital Techniques, 152(2):193–
207, 2005.
[TW01] R. J. Tocci and N. S. Widmer. Digital Systems: Principles and applications.
Prentice Hall, 2001.
[WP98] S. A. Wadekar and A. C. Parker. Accuracy sensitive word-length selection
for algorithm optimization. In Proceedings of the International Conference on
Computer Design, pages 54–61, 1998.
[Xil04] Xilinx Inc., San Jose. Virtex-2 Pro and Virtex-2 Pro X Platform FPGAs:
Complete Data Sheet. 2004.
[Xil07] Xilinx Inc., San Jose. Xilinx power estimator user guide. 2007.
[Xil08a] Xilinx Inc., San Jose. Xilinx integrated software environment (ISE). 2008.
[Xil08b] Xilinx Inc., San Jose. Xilinx System Generator for DSP guide. 2008.
BIBLIOGRAPHY 217
[Zah03] B. Zahiri. Structured ASICs: opportunities and challenges. In Proceedings of
the International Conference on Computer Design, pages 404–409, 2003.
218 BIBLIOGRAPHY
219
Appendix A
Diagram of SOS3 system
Figure A.1 shows the block diagram of the SOS3 LP system whose arithmetic component
power and inter routing power is estimated at the end of Chapters 4 and 5, respectively.
220 BIBLIOGRAPHY
a
b (ab)z
-1
Scale Mult3
a
b (ab)z
-1
Scale Mult2
a
b (ab)z
-1
Scale Mult1
0.2525634765625
Scale Constant3
0.022430419921875
Scale Constant2
0.001514434814453125
Scale Constant1
z
-1
Out Delay
z
-2
In Delay
 
 Out 
GatewayOut
 In  
GatewayIn
21
a
b(ab)z
-1
BiQuad3 MultA2
a
b(ab)z
-1
BiQuad3 MultA1
z
-1
BiQuad3 FF Delay
z
-3
BiQuad3 DelayB2
z
-2
BiQuad3 DelayB1
z
-3
BiQuad3 DelayA1
-0.8518905639648438
BiQuad3 ConstantA2
1.761245727539063
BiQuad3 ConstantA1
a
b
a + bz-1
BiQuad3 AddSubB1
a
b
a + bz-1
BiQuad3 AddSubB0
a
b
a + bz-1
BiQuad3 AddSub1
a
b
a + bz-1
BiQuad3 AddSub0
21
a
b(ab)z
-1
BiQuad2 MultA2
a
b(ab)z
-1
BiQuad2 MultA1
z
-1
BiQuad2 FF Delay
z
-3
BiQuad2 DelayB2
z
-2
BiQuad2 DelayB1
z
-3
BiQuad2 DelayA1
-0.6413497924804688
BiQuad2 ConstantA2
1.561019897460938
BiQuad2 ConstantA1
a
b
a + bz-1
BiQuad2 AddSubB1
a
b
a + bz-1
BiQuad2 AddSubB0
a
b
a + bz-1
BiQuad2 AddSub1
a
b
a + bz-1
BiQuad2 AddSub0
21
a
b(ab)z
-1
BiQuad1 MultA2
a
b(ab)z
-1
BiQuad1 MultA1
z
-1
BiQuad1 FF Delay
z
-3
BiQuad1 DelayB2
z
-2
BiQuad1 DelayB1
z
-3
BiQuad1 DelayA1
-0.540252685546875
BiQuad1 ConstantA2
1.464874267578125
BiQuad1 ConstantA1
a
b
a + bz-1
BiQuad1 AddSubB1
a
b
a + bz-1
BiQuad1 AddSubB0
a
b
a + bz-1
BiQuad1 AddSub1
a
b
a + bz-1
BiQuad1 AddSub0
Figure A.1: The System Generator block diagram representing the system SOS3 LP:
a low-pass IIR filter organised as three second order sections.
